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Preface 


This book provides a calculus-based introduction to probability and statistics. It con- 
tains enough material for two semesters but, with judicious selection, it can be used 
as a textbook for a one-semester course, either in probability and statistics or in prob- 
ability alone. 

Each section contains many examples and exercises and, in the statistical sec- 
tions, examples taken from current research journals. 

The discussion is rigorous, with carefully motivated definitions, theorems and 
proofs, but aimed for an audience, such as computer science students, whose mathe- 
matical background is not very strong and who do not need the detail and mathemat- 
ical depth of similar books written for mathematics or statistics majors. 

The use of linear algebra is avoided and the use of multivariable calculus is min- 
imized as much as possible. The few concepts from the latter, like double integrals, 
that were unavoidable, are explained in an informal manner, but triple or higher inte- 
grals are not used. The reader may find a few brief references to other more advanced 
concepts, but they can safely be ignored. 


Some distinctive features 


In Chapter 1, events are defined (following Kemeny and Snell, Finite Mathematics) 
as truth-sets of statements. Venn diagrams are presented with numbered rather than 
shaded regions, making references to those regions much easier. 

In Chapter 2, combinatorial principles involving all four arithmetic operations 
are mentioned, not just multiplication as in most books. Tree diagrams are empha- 
sized. The oft-repeated mistake of presenting a limited version of the multiplication 
principle, in which the selections are from the same set in every stage, and which 
makes it unsuitable for counting permutations, is avoided. 

In Chapter 3, the axioms of probabilities are motivated by a brief discussion of 
relative frequency and, in the interest of correctness, measure-theoretical concepts 
are mentioned, though not explained. 

In the combinatorial calculation of probabilities, evaluations with both ordered 
and unordered selections are given where possible. 
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de Méré’s first paradox is carefully explained (in contrast to many books where 
it is mishandled). 

Independence is defined before conditioning and is returned to in the context 
of conditional probabilities. Both concepts are illustrated by simple examples be- 
fore stating the general definitions and more elaborate and interesting applications. 
Among the latter are a simple version of the gambler’s ruin problem and Laplace’s 
tule of succession as he applied it to computing the chances of the sun’s rising the 
next day. 

In Chapter 4, random variables are defined as functions on a sample space, 
and first, discrete ones are discussed through several examples, including the basic, 
named varieties. 

The relationship between probability functions and distribution functions is 
stressed, and the properties of the latter are stated in a theorem, whose proof is rele- 
gated though to exercises with hints. 

Histograms for probability functions are introduced as a vehicle for transition- 
ing to density functions in the continuous case. The uniform and the exponential 
distribution are introduced next. 

A section is then devoted to obtaining the distributions of functions of random 
variables, with several theorems of increasing complexity and nine detailed exam- 
ples. 

The next section deals with joint distributions, especially in two dimensions. The 
uniform distribution on various regions is explored and some simple double integrals 
are explained and evaluated. The notation f(x, y) is used for the joint p.f. or density 
and fx (x) and fy(y) for the marginals. This notation may be somewhat clumsy, but 
is much easier to remember than using different letters for the three functions, as is 
done in many books. 

Section 4.5 deals with independence of random variables, mainly in two dimen- 
sions. Several theorems are given and some geometric examples are discussed. 

In the last section of the chapter, conditional distributions are treated, both for 
discrete and continuous random variables. Again, the notation fx)y (x, y) is preferred 
over others that are widely used but less transparent. 

In Chapter 5, expectation and its ramifications are discussed. The St. Petersburg 
paradox is explained in more detail than in most books, and the gambler’s ruin prob- 
lem is revisited using generating functions. 

In the section on covariance and correlation, following the basic material, the 
Schwarz inequality is proved and the regression line in scatter plots is discussed. 

In the last section of the chapter, medians and quantiles are discussed. 

In Chapter 6, the first section deals with the Poisson distribution and the Poisson 
process. The latter is not deduced from basic principles, because that would not be 
of interest to the intended audience, but is defined just by the distribution formula. 
Its various properties are derived though. 

In Section 6.2, the normal distribution is discussed in detail, with proofs for its 
basic properties. 

In the next section, the deMoivre—Laplace limit theorem is proved, and then used 
to prove the continuity correction to the normal approximation of the binomial, fol- 
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lowed by two examples, one of them in a statistical setting. An outline of Lindeberg’s 
proof of the central limit theorem is given, followed by a couple of statistical exam- 
ples of its use. 

In Section 6.4, the negative binomial, the gamma and beta random variables are 
introduced in a standard manner. 

The last section of the chapter treats the bivariate normal distribution in a novel 
manner, which is rigorous, yet simple and avoids complicated integrals and linear 
algebra. Multivariate normal distributions are just briefly described. 

Chapter 7 deals with basic statistical issues. Section 7.1 begins with the method 
of maximum likelihood, which is then used to derive estimators in various settings. 
The method of moments for constructing estimators is also discussed. Confidence 
intervals for means of normal distributions are also introduced here. 

Section 7.2 introduces the concepts of hypothesis testing, and is then continued 
in the next section with a discussion of the power function. 

In Section 7.4, the special statistical methods for normal populations are treated. 
The proof of the independence of the sample mean and variance, and of the distri- 
bution of the sample variance is in part original. It was devised to avoid methods of 
linear algebra. Sections 7.5, 7.6 and 7.7 describe chi-square tests, two-sample tests 
and Kolmogorov—Smirnov tests. 


Géza Schay 
University of Massachusetts, Boston 
May 2007 
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Introduction 


Probability theory is a branch of mathematics that deals with repetitive events whose 
occurrence or nonoccurrence is subject to chance variation. Statistics is a related 
scientific discipline concerned with the gathering, representation and interpretation 
of data, and with methods for drawing inferences from them. 

While the preceding statements are necessarily quite vague at this point, their 
meaning will be made precise and elaborated in the text. Here we shed some light on 
them by a few examples. 

Suppose we toss a coin, and observe whether it lands head (#7) or tail (7) up. 
While the outcome may or may not be completely determined by the laws of physics 
and the conditions of the toss (such as the initial position of the coin in the tosser’s 
hand, the kind of flick given to the coin, the wind, the properties of the surface on 
which the coin lands, etc.), and since these conditions are usually not known anyway, 
we cannot be sure on which side the coin will fall. We usually assign the number 1/2 
as the probability of either result. This can be interpreted and justified in several 
ways. First, it is a convention that we take the numbers from 0 to | as probability 
values, and the total probability for all the outcomes of an experiment to be 1. (We 
could use any other scale instead. For instance, when probabilities are expressed as 
percentages, we use the numbers from 0 to 100, and when we speak of odds we use a 
scale from 0 to infinity.) Hence, the essential part of the probability assignment 1/2 to 
both H and T is the equality of the probabilities of the two outcomes. Some people 
have explained this equality by a “principle of insufficient reason,” that is, that the 
two probabilities should be equal because we have no reason to favor one outcome 
over the other, especially in view of the symmetrical shape of the coin. This rea- 
soning does not stand up well in more complicated experiments. For instance in the 
eighteenth century several eminent mathematicians believed that in the tossing of two 
coins there are three equally likely outcomes, HH, HT, and TT, each of which should 
have probability 1/3. It was only through experimentation that people observed that 
when one coin shows H and the other 7, then it makes a difference which coin shows 
which outcome, that is, that the four outcomes, HH, HT, TH, and TT , each show up 
about one fourth of the time, and so each should be assigned probability 1/4. It is 
interesting to note, however, that in modern physics, for elementary particles exactly 
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the opposite situation holds, that is, they are very strangely indistinguishable from 
each other. Also, the laws of quantum theory directly give probabilities for the out- 
comes of measurements of various physical quantities, unlike the laws of classical 
physics, which predict the outcomes themselves. 

The coin tossing examples above illustrate the generally accepted form of the 
frequency interpretation of probabilities: we assign probability values to the possible 
outcomes of an experiment so as to reflect the proportions of the occurrence of each 
outcome in a large number of repetitions of the experiment. Due to this frequency 
interpretation, probability assignments and computations must follow certain simple 
tules, which are taken as axioms of the theory. The commonly used form of probabil- 
ity theory, which we present here, is based on this axiomatic approach. (There exist 
other approaches and interpretations of probability, but we will not discuss these 
here. They are mostly incomplete and unsettled.) In this theory we are not concerned 
with the justification of probability assignments. We make them in some manner that 
corresponds to our experience, and we use probability theory only to compute other 
probabilities and related quantities. On the other hand, in the theory of statistics we 
are very much concerned, among other things, with the determination of probabilities 
from repetitions of experiments. 

An example of the kind of problem probability theory can answer is the follow- 
ing: Suppose we have a fair coin, that is, one that has probability 1/2 for showing 
Hand 1/2 for T, and we toss it many times. I have 10 dollars and bet one dollar on 
each toss, playing against an infinitely rich adversary. What is the probability that I 
would lose all of my money within, say, 20 tosses? (About 0.026.) Or, to ask for a 
quantity that is not a probability: For how many tosses can I expect my $10 to last? 
(Infinitely many.) Similarly: How long can we expect a waiting line to grow, whether 
it involves people in a store or data in a computer? How long can a typical customer 
expect to wait? 

Examples of the kinds of problems that statistical theory can answer are the fol- 
lowing: Suppose I am playing the above game with a coin supplied by my opponent, 
and I suspect that he has doctored it, that is, the probabilities of H and T are not 
equal. How many times do we have to toss to find out with reasonable certainty 
whether the coin is fair or unfair? What are reasonable assignments of the probabili- 
ties of H and T? Or in a different context: How many people need to be sampled in a 
preelection poll to predict the outcome with a certain degree of confidence? (Surpris- 
ingly, a sample of a few hundred people is usually enough, even though the election 
may involve millions.) How much confidence can we have in the effectiveness of a 
drug tested on a certain number of people? How do we conduct such tests? 

Probability theory originated in the sixteenth century in problems of gambling, 
and even today most people encounter it, if at all, only in that context. In this book 
we too shall frequently use gambling problems as illustrations, because of their rich 
history and because they can generally be described more simply than most other 
types of problems. Nevertheless we shall not lose sight of the fact that probability 
and statistics are used in many fields, such as insurance, public opinion polls, medical 
experiments, computer science, etc., and we shall present a wide-ranging set of real 
life applications as well. 
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The Algebra of Events 


1.1 Sample Spaces, Statements, Events 


Before discussing probabilities, we must discuss the kinds of events whose probabil- 
ities we want to consider, make their meaning precise, and study various operations 
with them. 

The events to be considered can be described by such statements as “a toss of a 
given coin results in head,’ “a card drawn at random from a regular 52 card deck is 
an Ace,’ or “this book is green.” 

What are the common characteristics of these examples? 

First, associated with each statement there is a set S of possibilities, or possible 
outcomes. 


Example 1.1.1 (Tossing a Coin). For a coin toss, S may be taken to consist of two 
possible outcomes, which we may abbreviate as H and T for head and tail. We say 
that H and T are the members, elements or points of S, and write! § = {H,T}. 
Another choice might be S = {HH,HT,TH,TT}, where we toss two coins, but 
ignore one of them. In this case, for instance, the outcome “the first coin shows H” 
is represented by the set {HH, HT}, that is, this statement is true if we obtain HH or 
HT and false if we obtain TH or TT. 


Example 1.1.2 (Drawing a Card). For the drawing of a card from a 52 card deck, we 
can see a wide range of choices for S$, depending on how much detail we want for 
the description of the possible outcomes. Thus, we may take S to be the set {A, A}, 
where A stands for Ace and A for non-Ace. Or we may take S to be a set of 52 
elements, each corresponding to the choice of a different card. Another choice might 
be S = {S, H, D, C}, where the letters stand for the suit of the card: spade, heart, 
diamond, club. Not every statement about drawing a card can be represented in every 
one of these sample spaces. For example, the statement “an Ace is drawn” cannot be 


! Recall that the usual notation for a set is a list of its members between braces, with the 
members separated by commas. More about this in the next section. 
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represented in the last sample space, but it corresponds to the simple set {A} in the 
sample space {A, A}. 


Example 1.1.3 (Color of a Book). In this example S may be taken as the set {G, G}, 
where G stands for green, and G for not green. Or S may be the set {G, R, B, O}, 
where the letters stand for green, red, blue, and other. Another choice for S may be 
{LG, DG, G}, where the letters stand for light green, dark green, and not green. 


Example 1.1.4 (Tossing a Coin Until an H is Obtained). If we toss a coin until an H 
is obtained, we cannot say in advance how many tosses will be required, and so the 
the natural sample space is S = {H, TH, 7TH, TTTH, ...}, an infinite set. We can 
use, of course, many other sample spaces as well, for instance, we may be interested 
only in whether we had to toss the coin more than twice or not, in which case S = {1 
or 2, more than 2} is adequate. 


Example 1.1.5 (Selecting a Number from an Interval). Sometimes, we need an un- 
countable set for a sample space. For instance, if the experiment consists of choosing 
a random number between 0 and 1, we may use S = {x :0 <x < I}. 


As can be seen from these examples, many choices for S are possible in each 
case. In fact, infinitely many. This may seem confusing, but we must put every state- 
ment into some context, and while we have a choice over the context, we must make 
it definite; that is, we must specify a single set S whenever we want to assign proba- 
bilities. It would be very difficult to speak of the probability of an event if we did not 
know the alternatives. 

The set S that consists of all the possible outcomes of an experiment is called the 
universal set or the sample space of the experiment. (The word “universal” refers 
to the fact that S is the largest set we want to consider in connection with the ex- 
periment; “sample” refers to the fact that in many applications the outcomes are 
statistical samples; and the word “space” is used in mathematics for certain types of 
sets.) The members of S are called the possible outcomes of the experiment or the 
(sample) points or elements of S. 

The second common characteristic of the examples is that the statements are 
expressed as declarative sentences, which are true (t) for some of the possible out- 
comes and false (f) for the others. For any given sample space we do not want to 
consider statements whose truth or falsehood cannot be determined for each possible 
outcome, or conversely, once a statement is given, we must choose our sample space 
so that the statement will be ¢ or f for each point. 

For instance, the statement p = “an Ace is drawn” is t for A and f for A, if the 
first sample space of Example 1.1.2 is used. If we choose the more detailed sample 
space of 52 elements, then p is ¢ for the four sample points AS, AH, AD, and AC 
(these stand for the drawings of the Ace of spades, hearts, diamonds, and clubs, 
respectively), and p is f for the other 48 possible outcomes. On the other hand, the 
sample space {black, red} is not suitable if we want to consider this statement, since 
we cannot determine whether p is true or false if all we know is whether the card 
drawn is black or red. 
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All this can be summarized as follows: 
We consider experiments that are described by 


1. The sample space, i.e., the set of possible outcomes, 

2. A statement or several statements, which are true for certain outcomes in S and 
false for others. Such a statement is in effect a function from the set S$ to the 
two-element set {t, f}, that is, an assignment of ¢ to the outcomes for which the 
given statement is true and f to the outcomes for which the statement is false. 


Any performance of such an experiment results in one and only one point of S. 
Once the experiment has been performed, we can determine whether any given state- 
ments are ¢ or f for this point. Thus, given S, the experiment we consider consists 
of selecting one point of the set S, and we perform it only once. If we want to model 
repetitions, then we make a single selection from a new sample space whose points 
represent the possible outcomes of the repetitions. For example, to model two tosses 
of a coin, we may use the sample space S = {HH, HT, TH, TT} where the experi- 
ment consists of selecting exactly one of the four points HH, HT, TH, or TT, and we 
do this selection only once. 

The set of sample points for which a statement p is ¢ is called the truth-set of p, or 
the event described by, or corresponding to, p. For example, the event corresponding 
to the statement p = “an Ace is drawn” is the set P = {AS, AH, AD, AC} if the 52 
element sample space is used. Thus, we use the word “event” to describe a subset” 
of the sample space. Actually, if S is a finite set, then we consider every subset of S 
to be an event. (If S is infinite, some subsets may have to be excluded.) For example, 
if S = {LG, DG, G} is the sample space for the color of a book, then the event 
P ={LG, DG} corresponds to the statement p = “the book is green,” and the event 
Q = {DG, G} corresponds to g = “the book is dark green or not green” = “the book 
is not light green.” Incidentally, this example also shows that a statement can usually 
be phrased in several equivalent forms. 

We say that an event P occurs, if in a performance of the experiment the state- 
ment p corresponding to P turns out to be true. 

Warning: As can be seen from the preceding discussion, when we make a state- 
ment such as p = “a card drawn is an Ace,’ we do not imply that this is necessarily 
true, as is generally meant for statements in ordinary usage. Also, we must carefully 
distinguish the statement p from the statement g = “‘p is true.” In fact, even the latter 
statement may be false. Furthermore, we could have an infinite hierarchy of different 
statements based on this p. The next two would be: r = “gq is true” and s = “r is 
true.” 

In closing this section, let us mention that the events that consist of a single sam- 
ple point are called elementary events or simple events. For instance {LG}, {DG}, {G} 
are the elementary events in the sample space {LG, DG, G}. (The point LG and the 
set {LG} are conceptually distinct, somewhat as the person who is the president is 
conceptually different from his role as president. More on this in the next section.) 


2 Recall that a set A is said to be a subset of a set B if every element of A is also an element 
of B. 
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Exercise 1.1.1. A coin is tossed twice. A sample space S can be described in an 
obvious manner as {HH, HT, TH, TT}. 


(a) What are the sample points and the elementary events of this S? 
(b) What is the event that corresponds to the statement “at least one tail is obtained”? 
(c) What event corresponds to “at most one tail is obtained”? 


Exercise 1.1.2. A coin is tossed three times. Consider the sample space § = {HHH, 
HAT, HTH, HTT, THH, THT, TTH, TTT} for this experiment. 


(a) Is this S suitable to describe two tosses of a coin instead of the S in Exercise 
1.1.1? Explain! 
(b) What events correspond in this S to the statements 
x = “at least one head is obtained,” 
y = “at least one head is obtained in the first two tosses,” 
z = “exactly one head is obtained’? 


Exercise 1.1.3. (a) List four different sample spaces to describe three tosses of a 
coin. 

(b) For each of your sample spaces in part (a) give the event corresponding to the 
statement “at most one tail is obtained,” if possible. 

(c) Is it possible to find an event corresponding to the above statement in every con- 
ceivable sample space for the tossing of three coins? Explain! 


Exercise 1.1.4. Describe three different sample spaces for the drawing of a card from 
a 52-card deck other than the ones mentioned in the text. 


Exercise 1.1.5. In the 52-element sample space for the drawing of a card 


(a) Give the events corresponding to the statements p = “an Ace or a red King is 
drawn,’ and q = “the card drawn is neither red, nor odd, nor a face card: 
(b) Give statements corresponding to the events 


U = {AH, KH, OH, JH}, 


and 
V = {2C, 4C, 6C, 8C, 10C, 2S, 4S, 6S, 8S, 10S}. 


(In each symbol the first letter or number denotes the rank of the card, and the 
last letter its suit.) 


Exercise 1.1.6. Three people are asked on a news show before an election whether 
they prefer candidate A or B, or have no preference. Give two sample spaces for the 
possible answers. 


Exercise 1.1.7. The birth dates of a class of 20 students are recorded. Describe three 
sample spaces for the possible birthday of one of these students chosen at random. 


3 The face cards are J,Q,K. 
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1.2 Operations with Sets 


Before turning to a further examination of the relationships between statements and 
events, let us review the fundamentals of the algebra of sets. 

As mentioned before, a common way of describing a set is by listing its members 
between braces. For example {a, b, c} is the set consisting of the three letters a, b, 
and c. The order in which the members are listed is immaterial, and so is any possible 
repetition in the list. Thus {a, b,c}, {b, c,a} and {a,b, b,c, a} each represent the 
same set. Two sets are said to be equal if they have exactly the same members. Thus 
{a, b,c} = {a,b, b,c, a}. 

Sometimes we just give a name to a set, and refer to it by name. For example, we 
may call the above set A. 

We use the symbol € to denote membership in a set. Thus a € A means that a 
is an element of A or a belongs to A. Similarly d ¢ A means that d is not a member 
of A. 

Another common method of describing a set is that of using a descriptive state- 
ment, as in the following examples: Say S is the 52-element set that describes 
the drawing of a card. Then the set {AS, AH, AD, AC} can also be written as 
{x | x € S,x is an Ace} or as {x : x € S,x is an Ace}. We read these expres- 
sions as “the set of x’s such that x belongs to S and x is an Ace.” Also, if the context 
is clear, we just write this set as {x is an Ace}. 

Similarly, {x | 2 <x <3} = {x :2 <x <3} = {2 < x < 3} each denote the 
set of all real numbers strictly between 2 and 3. (This example also shows the real 
necessity of such a notation, since it would be impossible to list the infinitely many 
numbers between 2 and 3.) 

We say that a set A is a subset of a set B if every element of A is also an element 
of B, and denote this relation by A C B. For instance, {a,b} C {a,b,c}. We may 
also read A C B as “A is contained in B-” Notice, that by this definition every set is a 
subset of itself, too. Thus {a, b, c} C {a, b, c}. While this usage may seem strange, it 
is just a convention, which one often finds useful in avoiding a discussion of “proper” 
subsets and the whole of a set, separately. The notation A C B can also be turned 
around and written as B D A, and read as “B is a superset of A.” 

Given two sets A and B, a new set, called the intersection of A and B, is defined 
as the set consisting of all the members common to both A and B, and is denoted by 
AQ B or by AB. The name “intersection” comes from the case in which A and B are 
sets of points in the plane. In Figure 1.1, for instance, A and B are the sets of points 
inside the two circles, and AB is the set of points of the region labeled I. 

Another example: {a, b, c,d} {b, c, e} = {b, c}. See Figure 1.2. 

For any two sets A and B, another useful set, called the union of A and B, is 
defined as the set whose members are all the members of A and B taken together, 
and is denoted by A U B. Thus, in Figure 1.1 the regions I, II, and II together make 
up AUB. 

Also, {a, b,c, d}U{b, c, e} = {a, b, c, d, e}. A diagram can illustrate this relation 
too, as shown in Figure 1.2. Here the circles and other regions do not represent sets 
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S 


IV 


Fig. 1.1. 


of points of the plane, but the sets of letters inscribed into them. Such diagrams are 
called Venn diagrams. 

A third important operation is subtraction of sets: A — B denotes the set of those 
points of A that do not belong to B. Thus in Figure 1.1, A— B is region II, and B— A 
is region II. 

If we subtract a set A from the universal set S, that is, consider § — A, the result is 
called the complement of A, and we denote it by A. (There is no standard notation for 
this operation, some books use ~ A, A, A’ or A® instead.) In Figure 1.1, ‘A consists 
of the regions III and IV, and B of ILand IV. 

Using both intersection and complement, we can represent each of the regions in 
Figure 1.1 in a very nice symmetrical manner as 


I=ANB, 1 =ANB, I =ANB, IV =ANB. 


Also, we see that A— B= ANBandB—-A=BNA. 

Here we end the list of set-operations but, in order to make these operations 
possible for all sets, we need to introduce a new set, the so-called empty set. The role 
of this set is similar to that of the number zero in operations with numbers: Instead 
of saying that we cannot subtract a number from itself, we say that the result of such 
a subtraction is zero. Similarly, if we form A — A for any set A, we say that the result 
is the set with no elements, which we call the empty set, and denote by 4. We obtain 


A B 


Fig. 1.2. 


1.2 Operations with Sets 9 
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# in some other cases too: If A is contained in B, that is, A C B, then A— B = GJ. 
Also, if A and B have no common element, then AN B = YJ. In view of this relation, 
% is said to be a subset of any set A, that is, we extend the definition of C to include 
6 c A, for any A. 

Warning: the empty set must not be confused with the number zero. While 9 is 
a set, 0 is a number, and they are conceptually distinct from each other. (The empty 
set can also be used to illuminate the mentioned distinction between a one-member 
set and its single member: {} is a set with one element; and the one element is J, a 
set with no element.) 


Exercises 


Exercise 1.2.1. Use alternative notations to describe the following sets: 


(a) The set of odd numbers between 0 and 10, 

(b) {2, 4, 6, 8, 10}, 

(c) the set of black face cards in a regular deck, 

(d) {x : -3 <x <3 and x” = 1,4,or9}, 

(e) the set of all real numbers strictly between —1 and +1. 


Exercise 1.2.2. Referring to the Venn diagram in Figure 1.3, identify, by numbers, 
the regions corresponding to 


(a) (AUB)NC, 
(b) AN(BNC), 
(c) AN (BNC), (The complement of the set in (b).) 
(d) (AUB)UC, 
(e) AN(BNC), 
f) (ANB)NC, 
(2) A- (BNC). 


Exercise 1.2.3. List all the subsets of {a, b, c}. (There are eight.) 
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Exercise 1.2.4. Referring to Figure 1.1, show, by listing the regions corresponding 
to both sides of the equations, that 


(a) AUB =ANB,and 
(b) ANB=AUB, 


(These are called deMorgan’s Laws.) 


Exercise 1.2.5. The intersection of several sets A, B,C,... , Z is defined as the set 
of points that belong to each, and is denoted by AN BN CN ---/M Z. Show using 
Figure 1.3 that AN BNC = (AN B)NC=AN(BNC) = BN (ANC), and so 
the parentheses are superfluous in such expressions. 


Exercise 1.2.6. (a) How would you define the union of several sets? 
(b) Show using Figure 1.3 that 


AUBUC =AU(BUC) =(AUB)UC =(AUC)UB. 


Exercise 1.2.7. Show using Figure 1.3 that in general 


(a) AN(BUC) # (ANB) UC, but 
(b) AN (BUC) = (AN B)U(ANC), and 
(c) (ANB)UC = (AUC)N (BUC). 


Exercise 1.2.8. Referring to Figure 1.3, express the following regions by using 
A, B,C and unions, intersections and complements: 


(a) {8}, 

(b) {3}, 

(c) {1,4, 5}, 
(d) {1, 4,5, 8}, 
(e) {2, 6}, 

(f) {2, 6, 7}. 


Exercise 1.2.9. If AN B = Y%, what are AN B and A U B ? Illustrate by a Venn 
diagram. 


Exercise 1.2.10. We have A = B if and only if A C B and B C A. Use this 
equivalence to prove deMorgan’s laws (see Exercise 1.2.4). 


Exercise 1.2.11. Prove that A C B if and only if AUB = B. 


Exercise 1.2.12. Prove that A C B if and only if AN B = A. 
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1.3 Relationships between Compound Statements and Events 


When dealing with statements, we often consider two or more at a time connected by 
words such as “and” and “or.” This is also true when we want to discuss probabilities. 
For instance, we may want to know the probability that a card drawn is an Ace and 
red, or that it is an Ace or a King. Often we are also interested in the negation of 
a statement, as in “the card drawn is not an Ace.” We want to examine how these 
operations with statements are reflected in the corresponding events. 


Example 1.3.1 (Drawing a Card). Consider the statements p = “the card drawn 
is an Ace” and q = “the card drawn is red.” The corresponding sets are P = 
{AS, AH, AD, AC} and Q = {2H,2D,3H,3D,...,AH, AD}. Now the state- 
ment “p and q” can be abbreviated to “the card drawn is an Ace and red” (which is 
short for “the card drawn is an Ace and the card drawn is red’). This is obviously true 
for exactly those outcomes of the drawing for which p and q are both true, that is, 
for those sample points that belong to both P and Q. The set of these sample points 
is exactly PM Q = {AH, AD}. Thus, the truth-set of “p and qg,” that is, the event 
corresponding to this compound statement, is PM Q. 

Similarly, “p or q” is true for those outcomes for which p is true or gq is 
true, that is, for the points of P and of Q put together.* This is by definition the 
union of the two sets. Thus the truth-set of “p or g” is P U Q. In our case “p 


or gq” = “the card drawn is an Ace or red” has the 28-element truth-set PU Q = 
{AS, AC, 2H,2D,3H,3D,...,AH, AD}. 
Furthermore, the statement “not p” = “the card drawn is not an Ace” is obviously 


true whenever any of the 48 cards other than one of the Aces is drawn. The set 
consisting of the 48 outcomes not in P is by definition the complement of P. Thus 
the event corresponding to “not p” is P. 


The arguments used in the above example obviously apply to arbitrary state- 
ments, too, not just to these specific ones. Thus we can state the following general 
result: 


Theorem 1.3.1 (Correspondence between Logical Connectives and Set Opera- 
tions). /f P and Q are the events that correspond to any given statements p and q, 
then the events that correspond to “p and q,” “p or q” and “not p” are PM Q, 
P U Q and P, respectively. 


Some other, less important connectives for statements will be mentioned in the 
next example and in the exercises. 


Example 1.3.2 (Choosing a Letter). Let S = {a,b,c,d,e}, A = {a,b,c, d}, and 
B = {b,c, e}. (See Figure 1.2.) Thus S corresponds to our choosing one of these 
five letters. Let us name the statements corresponding to A and B, p and q. In other 
words, let p = “a,b,c, ord is chosen,” and g = “b, c, or e is chosen.” Then A — B = 


4 In mathematics, we use “or” in the inclusive sense, that is, including tacitly the possibility 
“or both.” 
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Fig. 1.4. Throwing two dice. 


{a, d} obviously corresponds to the statement “p but not g” = “a, b,c or d, but not 
b, c, or e is chosen.” (As we know, we can also write AM B for A — B.) Similarly 
B — A = {e} corresponds to “g but not p,”’ and (A — B) U (B — A) = {a,d,e} 
corresponds to “either p or g (but not both).” (The set (A — B) U (B — A) is called 
the symmetric difference of A and B, and the “or” used here is called the “exclusive 
or.’ 


Example 1.3.3 (Two Dice). Two dice are thrown, say, a black one, and a white one. 
Let b stand for the number obtained on the black die and w for the number on the 
white die. A convenient diagram for S is shown in Figure 1.4. The possible outcomes 
are pairs of numbers such as (2, 3) or (6, 6). (We write such pairs within parentheses, 
rather than braces, and call them ordered pairs, because, unlike in sets, the order of 
the numbers is significant: the first number stands for the result of the throw of one 
die, say the black one, and the second number for the white die.) The set S can be 
written as S = {(b,w):b=1,2,...,6andw =1,2,... , 6}. 

Let p = “b+w = 7, that is, p = “the sum of the numbers thrown is 7,” and 
q = “w < 3. The corresponding truth sets P = {(b,w) : b+ w = 7} and Q= 
{(b, w) : w < 3} are shown shaded in Figure 1.4. The event corresponding to “p and 
q” = “the sum of the numbers thrown is 7 and the white die shows no more than 3” 
is the doubly shaded set P71 Q = {(4, 3), (5, 2), (6, 1)}. The event corresponding 
to “p or q” is represented by the 18 + 3 = 21 shaded squares in Figure 1.4; it is 
PUQ={(b,w):b+w=7or w < 3}. The 15 unshaded squares represent the 
event PN Q , which corresponds to “neither p nor q.” 


Exercises 


Exercise 1.3.1. Consider the throw of two dice as in Example 1.3.3. Let S, p and q 
be the same as there, and let r = “b is 4 or 5.” Describe and illustrate as in Figure 
1.4 the events corresponding to the statements 
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(a) r, 

(b) q orr, 

(c) r but not q, 

(d) p and gq andr, 

(e) g andr, but not p. 


Exercise 1.3.2. Let a,b,c be statements with truth-sets A, B and C respectively. 
Consider the following statements: 

Pi = “exactly one of a, b, c occurs,” 

p2 = “at least one of a, b, c occurs,” 

p3 = “at most one of a, b, c occurs,” 

In Figure 1.3 identify the corresponding truth sets P;,P2, P3 by the numbers of 
the regions and express them using unions, intersections and complements of A, B, 
and C. 


Exercise 1.3.3. Again, let a, b,c be statements with truth-sets A, B and C respec- 
tively. Consider the following statements: 

p4 = “exactly two of a, b,c occur,” 

ps = “at most two of a, b, c occur,” 

Po = “at least two of a, b, c occur.” 

In Figure 1.3 identify the corresponding truth sets P4, Ps, Ps by the numbers of 
the regions and express them using unions, intersections and complements of A, B, 
and C. 


Exercise 1.3.4. Let a = “an Ace is drawn” and b = “a red card is drawn,” let S be 
our usual 52-point sample space for the drawing of a card and A and B the events 
corresponding to a and b. 


(i) What logical relations correspond to deMorgan’s Laws (Exercise | .2.4) for these 
statements? 
(ii) To what statement does S correspond? 


Exercise 1.3.5. Suppose A and B are two subsets of a sample space S such that AUB 
= §.If A and B correspond to some statements a and b, what can you say about the 
latter? 


Exercise 1.3.6. Again, let A and B be events corresponding to statements a and b. 
How are a and b related if AN B = G? 


2 


Combinatorial Problems 


2.1 The Addition Principle 


As mentioned in the Introduction, if we assume that the elementary events of an 
experiment with finitely many possible outcomes are equally likely, then the assign- 
ment of probabilities is quite simple and straightforward.' For example, if we want 
the probability of drawing an Ace when the experiment consists of the drawing of a 
card under the assumption that any card is as likely to be drawn as any other, then we 
can say that 1/52 is the probability of drawing any of the 52 cards, and 4/52 = 1/13 
is the probability of drawing an Ace, since there are 4 Aces in the deck. We obtain 
the probability by taking the number of outcomes making up the event that an Ace 
is drawn, and dividing it by the total number of outcomes in the sample space. Thus 
the assignment of probabilities is based on the counting of numbers of outcomes, if 
these are equally likely. The counting was very simple in the above example, but in 
many others it can become quite involved. For example, the probability of drawing 
two Aces if we draw two cards at random (this means “with equal probabilities for 
all possible outcomes”) from our deck is (4-3)/(52-51) = 0.0045, since, as we 
shall see in the next section, 4-3 = 12 is the number of ways in which two Aces can 
be drawn, and 52-51 = 2652 is the total number of possible outcomes, that is, of 
possible pairs of cards. 

Since the counting of cases can become quite complicated, we are going to 
present a systematic discussion of the methods required for the most important count- 
ing problems that occur in the applications of the theory. Such counting problems are 
called combinatorial problems, because we count the numbers of ways in which dif- 
ferent possible outcomes can be combined. 

The first question we ask is: What do our basic set operations do to the numbers 
of elements of the sets involved? In other words if we let n(X) denote the number of 
elements of the set X for any X, then how are n(A),1(B),n(AB),n(AU B), n(A), 
n(A — B), etc., related to each other? 

We can obtain several relations from the following obvious special case: 


! Tn this chapter every set will be assumed to be finite. 
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If AN B =9, thenn(AU B) =n(A) +n(B). (2.1) 


We can restate this as: If A and B do not overlap, then the number of elements in 
their union equals the sum of the number of elements of A and of B. Basically this is 
nothing else but the definition of addition: The sum of two natural numbers has been 
defined by putting two piles together. 

When two sets do not overlap, that is, AM B = Y, then we call them disjoint 
or mutually exclusive. Similarly, we call any number of sets disjoint or mutually 
exclusive if no two of them have a point in common. For three sets, A, B, C, for 
instance, we require that AN B = 6, ANC =@ and BNC = Y, if we want them 
to be disjoint. Notice that it is not enough to require AN BMC = ¥Y. While the 
latter follows from the former equations, we do not have it the other way around, 
and obviously we need the first three conditions if we want to extend the addition 
principle to A, B, and C. By repeated application of the addition principle we can 
generalize it to any finite number of sets: 


Theorem 2.1.1. /f A,, Az,... , Ag are k disjoint sets, then 
n(A; UA2U---U Ag) = n(Aq) + n(A2) + +++ + n(Ag). (2.2) 


We leave the proof as an exercise. 
If the sets involved in a union are not necessarily disjoint, then the addition prin- 
ciple leads to 


Theorem 2.1.2. For any two sets A and B, 
n(AU B) =n(A)4+n(B)—n(ANB). (2.3) 


Proof. We have A = (ANB)U(ANB) and B = (ANB)U(ANB), with ANB, ANB 
and AN B disjoint (see Figure 2.1). 
Thus 


n(A) =n(AN B) +n(AN B) and n(B) = n(AN B) +n(AN B). (2.4) 


AB 


Fig. 2.1. 
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Adding, we get 
n(A) +n(B) = n(AN B) +n(AN B) +n(ANB) +n(ANB). (2.5) 


On the other hand, A U B is the union of the disjoint sets AN B, AN B, and 
AQ B.So, by Theorem 2.1.1, 


n(AU B) =n(AN B)+n(AN B)+n(AN B). (2.6) 


The right-hand side of this equation is the same as the sum of the first three terms 
on the right of Equation 2.5. Thus 


n(A)+n(B) =n(AUB)+n(AN B). (2.7) 
Rearranging the terms results in the formula of the theorem. Oo 


Example 2.1.1 (Survey of Drinkers and Smokers). In a survey, 100 people are asked 
whether they drink or smoke or do both or neither. The results are: 60 drink, 30 
smoke, 20 do both, and 30 do neither. Are these numbers compatible with each other? 

If we let A denote the set of drinkers, B the set of smokers, N the set of those 
who do neither, and S the set of all those surveyed, then the data translate to n(A) = 
60,n(B) = 30,n(AN B) = 20,n(N) = 30,n(S) = 100. Also, AUBUN=S, 
and A U B and N are disjoint. So we must have n(A U B) + n(N) = n(S), that is 
n(A U B) + 30 = 100. By Theorem 2.1.2,n(A U B) = n(A) +n(B) —n(AN B). 
Therefore in our case n(A U B) = 60+ 30 — 20 = 70 and n(A U B) +30 = 70 + 30 
is indeed 100, which shows that the data are compatible. © 


Let us mention that we could have argued less formally that Theorem 2.1.2 must 
be true because, if we form n(A) + 1(B), we count all the points of A U B, but those 
in AN B are then counted twice (once as part of n(A) and once as part of n(B)). So, 
in forming n(A) +n(B) —n(ANB), the subtraction undoes the double-counting and 
each point in A U B is counted exactly once. 

Theorem 2.1.2 can be generalized to unions of three or more sets. For example, 


n(AU BUC) =n(A) +n(B) +n(C) —n(AN B)—n(ANC)—-n(BNC) 
+n(ANBNC). (2.8) 


We leave the proof of this equation as an exercise. This result and the analogous 
formulas for more sets are much less important in applications than the case of two 
sets given in Theorem 2.1.2, and we shall not discuss them further. 

From the addition principle, it is easy to see that in general 


n(B — A) =n(B)—n(AN B) (2.9) 
and 
n(B — A) = n(B) — n(A) if and only if A C B. (2.10) 


(This relation is sometimes called the subtraction principle.) Substituting S for B, 
we get 


n(A) = n(S) — n(A). (2.11) 
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Exercises 


Exercise 2.1.1. If in a survey of 100 people, 65 people drink, 28 smoke, and 30 do 
neither, then how many do both? 


Exercise 2.1.2. Give an example of three pairwise nondisjoint sets A, B and C such 
that AN BNC =¥9. 


Exercise 2.1.3. Prove that any one of the conditions AN B = %,A NC = Q, or 
BNC =9%implies ANBNC=¥9. 


Exercise 2.1.4. Prove Theorem 2.1.1 


(a) for k = 3, 
(b) for arbitrary k. 


Exercise 2.1.5. Prove the formula given in Equation 2.8 for n(A U B UC) by using 
the Venn diagram of Figure 1.3 on page 9. 


Exercise 2.1.6. How many cards are there in a deck of 52 that are 


(a) Aces or spades, 
(b) neither Aces nor spades, 
(c) neither Aces nor spades nor face cards (J, Q, K)? 


2.2 Tree Diagrams and the Multiplication Principle 


In the previous section we worked with fixed sample spaces and counted the num- 
ber of points in single events. Here we are going to consider the construction of 
new sample spaces and events from previously given ones, and count the number 
of possibilities in the new sets. For example, we throw a die three times, and want 
to relate the number of elements of a sample space for this experiment to the three 
six-element sample spaces for the individual throws. Or we draw two cards from a 
deck, and want to find the number of ways in which the two drawings both result in 
Aces, by reasoning from the separate counts in the two drawings. 

The best way to approach such multistep problems, is by drawing a so-called tree 
diagram. In such diagrams we first list the possible outcomes of the first step, and 
then draw lines from each of those to the elements in a list of the possible outcomes 
that can occur in the second step depending on the outcome in the first step. We 
continue likewise for the subsequent steps, if any. 

The above description may be unclear at this point; let us clarify it by some 
examples. 


Example 2.2.1 (Drawing Two Aces). Let us illustrate the possible ways of succes- 
sively drawing two Aces from a deck of cards (we do not replace the first one before 
drawing the second). In the first step, we can obtain AS, AH, AD, AC, but in the 
second step we can only draw an Ace that has not been drawn before. This is shown 
in Figure 2.2. 
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“mn ae ee 


First step: AS AH AD AC 


JN /N. ZN, ZS 


Second step). AW AD AC AS AD AC AS AH AC AS AH AD 


Fig. 2.2. 


As we see, for each choice in the first step, there are three possible choices in the 
second step; thus altogether there are 4 - 3 = 12 choices for two Aces. In the figure, 
for the sake of completeness, we included a harmless extra point on the top, labeled 
“Start,” so that the four choices in the first step do not hang loose. We could turn the 
diagram upside down (or sideways, too), and then it would resemble a tree: this is 
the reason for the name. The number 12 shows up two ways in the diagram: first, it 
is the number of branches from the Start to the bottom, and second, it is the number 
of branch tips, that is, entries in the bottom row, whether they are distinct or not. 


Example 2.2.2 (Primary Elections). Before primary elections, voters are polled about 
their preferences in a certain state. There are two Republican candidates R; and Ro, 
and three Democratic candidates D;, D2, D3. The voters are first asked whether they 
are registered Republicans (R), Democrats (D) or Independents (J), and second, 
which candidate they prefer. The Independents are allowed to vote in either primary, 
so in effect they can choose any of the five candidates. The possible responses are 
shown in the tree of Figure 2.3. 


Notice that the total number of branches in the second step is 10, which can be ob- 
tained by using the addition principle: we add the three branches through D, the two 
through R, and the five through 7. The branches correspond to mutually exclusive 
events in the 10-element compound sample space {DD,, DD2, DD3, RR,, RR2, ID), 
ID2, ID3, IR, IR}. This is the new sample space built up in a complicated manner 
from the simpler ones {D, R, 7}, {D,, D2, D3} and {R1, Ro}. 


Example 2.2.3 (Tennis Match). In a tennis match two players, A and B, play several 
sets until one of them wins three sets. (The rules allow no ties.) The possible sequence 
of winners is shown in Figure 2.4. 
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The circled letters indicate the ends of the 20 possible sequences. As can be seen, 
the branches have different lengths, and this makes the counting more difficult than in 
the previous examples. Here, by repeated use of the sample space {A, B}, we built up 
the 20-element sample space {AAA, AABA, AABBA, AABBB, ABAA, ABABA, 
ABABB,..., BBB}. 

Notice that if we look upon these strings of A’s and B’s as words, then they 
are arranged in alphabetical order (e.g., AAA before AABA). Arranging selections 
in alphabetical or numerical order is often very helpful in making counts accurate, 
since it helps 1) to avoid unwanted repetitions, and 2) to ensure that everything is 
listed. © 


We discussed in Example 2.2.2 how the addition principle was applicable there. 
Now, it is easy to see that it is applicable in Example 2.2.1 and Example 2.2.3 as 
well. The latter was intended to illustrate branches of various lengths, and we cannot 
extract any important regularity from it. In Example 2.2.1, however, we see the op- 
eration of multiplication showing up for the first time. The four choices in the first 
step fan out into three branches each, and so, by the addition principle, we obtain the 
total number of branches for the second step if we add 3 to itself four times. This 
operation, however, is the same as multiplication of 3 by 4. In general, since mullti- 
plication by a natural number is repeated addition, if we have n, choices in the first 
step of an experiment, and each of those gives rise to n2 choices in the second step, 
then the number of possible outcomes for both steps together, that is, the number of 
paths from top to bottom of the corresponding two-step tree is njn2. 

We can easily generalize this statement to experiments with several steps, and 
call it a new principle: 
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The Multiplication Principle. /f an experiment is performed in m steps, and there 
are n, choices in the first step, and for each of those there are nz choices in the 
second step, and so on, with nm choices in the last step for each of the previous 
choices, then the number of possible outcomes, for all the steps together, is given by 
the product njn2n3---Nn.- 


Example 2.2.4 (Three Coin Tosses). Toss a coin three times. Then the number of 
steps is m = 3, and in each step we have two possibilities H or T, hence ny = 
n2 = n3 = 2. Thus the total number of possible outcomes, that is, of different triples 
of H’s and T’s, is 2-2-2 = 2? = 8. Similarly in m tosses we have 2” possible 
sequences of H’s and T’s. 


Example 2.2.5 (Number of Subsets). The number of subsets of a set of m elements is 
2”. This can be seen by considering any subset as being built up in m steps: We take 
in turn each of the m elements of the given set, and decide whether it belongs to the 
desired subset or not. Thus we have m steps, and in each step two choices, namely 
yes or no to the question of whether the element belongs to the desired subset. The 
2” subsets include % and the whole set. (Why?) 


Example 2.2.6 (Drawing Three Cards). The number of ways three cards can be 
drawn one after the other from a regular deck is 52° if we replace each card be- 
fore the next one is drawn, and 52 - 51 - 50 ways if we do not replace them. Since, 
obviously, we have three steps in both cases, i.e., m = 3; and with replacement we 
can pick any of the 52 cards in each step, that is, ny = n2 = n3 = 52; and without 
replacement we can pick any of the nj = 52 cards in the first step, but for the sec- 
ond step only nz = 51 cards remain to be drawn from, and for the third step only 
n3 = 50. 


Example 2.2.7 (Seating People). There are four seats and three people in a car, but 
only two can drive. In how many ways can they be seated if one is to drive? 

For the driver’s seat we have 2 choices, and for the next seat 3, because either of 
the remaining two people can sit there or it can remain empty. For the third seat we 
have two possibilities in each case: if the second seat was left empty, then either of the 
remaining two people can be placed there, and if the second seat was occupied, then 
the third one can either be occupied by the remaining person, or be left empty. The 
use of the fourth seat is uniquely determined by the use of the others. Consequently, 
the solution is2-3-2-1= 12. 

Alternatively, once the driver has been selected in 2 possible ways, the second 
person can take any one of 3 seats and the third person one of the remaining 2 seats. 
Naturally, we get the same result: 2-3-2 = 12. 

Notice, that in this problem we had to start our counting with the driver, but then 
had a choice whether to assign people to seats or seats to people. Such considerations 
are typical in counting problems, and often the nature of the problem favors one 
choice over another. 


Example 2.2.8 (Counting Numbers with Odd Digits). How many natural numbers are 
there under 1000 whose digits are odd? 
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Since all such numbers have either one, two, or three digits, we count those cases 
separately, and add up the three results. First, there are 5 single-digit odd numbers. 
Second, there are 5* numbers with two odd digits, since each of the two digits can 
be chosen five ways. Third, we can form 5° three-digit numbers with odd digits only. 
Thus the solution is 5 + 5* +53 = 155. 


Exercises 


Exercise 2.2.1. (a) What sample space does Figure 2.2 illustrate? 

(b) What are the four mutually exclusive events in this sample space that correspond 
to the drawing of AS, AH, AD, AC, respectively, in the first step? 

(c) What is the event corresponding to the statement “‘one of the two cards drawn is 
AH”? 


Exercise 2.2.2. In a survey, voters are classified according to sex (M or F), party 
affiliation (D, R, or [), and educational level (say A, B, or C). Illustrate the possible 
classifications by a tree diagram! How many are there? 


Exercise 2.2.3. In an urn there are two black and four white balls. (It is traditional 
to call the containers urns in such problems.) Two players alternate drawing a ball 
until one of them has two white ones. Draw a tree to show the possible sequences of 
drawings. 


Exercise 2.2.4. In a restaurant, a complete dinner is offered for a fixed price in which 
a choice of one of three appetizers, one of three entrees, and one of two desserts is 
given. Draw a tree for the possible complete dinners. How many are there? 


Exercise 2.2.5. Three different prizes are simultaneously given to students from a 
class of 30 students. In how many ways can the prizes be awarded 


(a) if no student can receive more than one prize, 
(b) if more than one prize can go to a student? 


Exercise 2.2.6. How many positive integers are there under 5000 that 


(a) are odd, 

(b) end in 3 or 4, 

(c) consist of only 3’s and/or 4’s, 
(d) do not contain 3’s or 4’s? 


(Hint: In some of these cases it is best to write these numbers with four digits, for 
instance, 15 as 0015, to choose the four digits separately and use the multiplication 
and addition principles.) 


Exercise 2.2.7. In the Morse code, characters are represented by code words made 
up of dashes and dots. 


(a) How many characters can be represented with three or fewer dashes and/or dots? 
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(b) With four or fewer? 


Exercise 2.2.8. A car has six seats including the driver’s, which must be occupied 
by a driver. In how many ways is it possible to seat 


(a) six people if only two can drive, 
(b) five people if only two can drive, 
(c) four people if each can drive? 


2.3 Permutations and Combinations 


Certain counting problems recur so frequently in applications that we have special 
names and symbols associated with them. These will now be discussed. 

Any arrangement of things in a row is called a permutation of those things. We 
denote the number of permutations of r different things out of n different ones by 
nP,.This number can be obtained by the multiplication principle. For example g P3 = 
8-7-6 = 336, because we have r = 3 places to fill in a row, out of n = 8 objects. 
The first place can be filled 8 ways; the second place 7 ways, since one object has 
been used up; and for the third place 6 objects remain. Because all these selections 
are performed one after the other, g P3 is the product of the three numbers 8, 7, and 
6. 

In general, , P can be obtained by counting backwards r numbers starting with 
n, and multiplying these r factors together. If we want to write a formula for , P, 
(which we need not use, we may just follow the above procedure instead), we must 
give some thought to what the expression for the last factor will be: In place 1 we can 
put n objects, which we can write as n— 1+ 1; in place 2 we can putn—1 =n—2+1 
objects; and so on. Thus the rth factor will be n — r + 1, and so, for any” positive 
integers n andr <n, 


nPy =n(n— 1)(n—2)---(2—r +1). (2.12) 


We can check that for our example, in which n = 8 andr = 3, we obtain 
n—-r+1=8-—3-+1=6, which was indeed the last factor in g P3. 

For the product that gives , P, we have a special name and a symbol. We call it 
n-factorial, and write it as n!. Thus, for any positive integer n, 


Wan = HDs 9s Bel: (2.13) 


? Note that the product on the right-hand side of Equation 2.12 does not have to be taken 
literally as containing at least four factors. This expression is the usual way of indicating 
that the factors should start with n and go down in steps of 1 ton —r + 1. For instance, if 
r = 1,thenn —r +1 =n, and the product should start and end with n, that is, » P) =n. 
The obvious analog of this convention is generally used for any sums or products in which 
a pattern is indicated, for example in Equation 2.13 as well. 
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The symbol n! is just a convenient abbreviation for the above product, that is, for 
the product of all natural numbers from | to 7 (the order does not really matter). For 
example, 1! = 1,2! =2-1=2,3!=3-2-1=6,44=4-.3-2-1=24. 

As we have said, the number of permutations of n things out of n is ,P, =n!. 

From the definitions of n!, (1 —1r)! and , P, we can obtain the following relation: 
n! = [n(n—1)(n—-2)---(n—-r4+)D][m—r)n—r-—1)---2-l]=,P-(n—-r)!, 
and so 


n! 
P, = ———_.. 2.14 
P= oy (2.14) 
Formulas 2.12 and 2.13 defined , P, and n! for all positive integer values of n 
andr <n. The above formula, however, becomes meaningless for r = n, since then 
n —r = 0, and we have not defined 0!. To preserve the validity of this formula for 
the case of r = n, we define 0! = 1. Then, for r = n, Formula 2.14 becomes 


n!} 
nin = Gy am (2.15) 
as it should. We shall see later that, by this definition, many other formulas also 
become meaningful whenever 0! appears. We can also extend the definition of ,, P, 


to the case of r = 0, by setting 
nPo = 1, (2.16) 


as required by Equation 2.14, and we can further extend the definition to n = 0, by 
defining 9 Pp = | as well. 


Example 2.3.1 (Dealing Three Cards). In how many ways can three cards be dealt 
from a regular deck of 52 cards? 

The answer is 52 P3 = 52-51-50 = 132, 600. Notice that in this answer, the order 
in which the cards are dealt is taken into consideration, not only the result of the deal. 
Thus a deal of AS, AH, K H is counted as a case different from AH, KH, AS. 


In many problems, as in the above example, it is unnatural to concern ourselves 
with the order in which things are selected, and we want to only count the number 
of different possible selections without regard to order. The number of possible un- 
ordered selections of r different things out of n different ones is denoted by ,C;, and 
each such selection is called a combination of the given things. 

To obtain a formula for ,C, we can argue the following way. If we select r things 
out of n without regard to order, then, as we have just said, this can be done in ,,C, 
ways. In each case we have r things which can be ordered r! ways. Thus, by the 
multiplication principle, the number of ordered selections is ,C, - r!. On the other 
hand, this number is, by definition, , P,. Therefore ,C; -r! = ,P,-, and so 


Cr = 2 = _, (2.17) 
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The quantity on the right-hand side is usually abbreviated as Ch, and is called a 
binomial coefficient, for reasons that will be explained in the next section. We have, 
for example, 


3 3! 6 7 7! 7-6-5 
= — =3, and = — = = 35. 
2 213 —2)! 2-1 3 3!4! 3-2-1 


In the latter example the 4! could be cancelled, and we could similarly cancel 
(n — r)! in the general formula, as we did for ,, P,. Thus, for any positive integer n 
andr = 1,2,... ,n, 


n n(n — 1)(n — 2)---(n—r+1) 
nCr = ( ) _ 1 ‘ (2.18) 
r r} 
For r = 0 the cancellation, together with 0! = 1, gives 
Ci pee as (2.19) 
oe—=\o) Oa—o 
and we can extend the validity of this formula to n = 0 as well. 
The formula 
n n!} 
= ——_. (2.20) 
r ri(n—r)! 


remains unchanged if we replace r by n — r, and so 


n nN 
( )=("). (2.21) 
n-r r 


This formula says that the number of combinations of n —r things out of n equals 
the number of combinations of r things out of n. We can easily see that this must be 
true, since whenever we make a particular selection of n — r things out of n, we are 
also selecting the r things that remain unselected, that is, we are splitting the n things 
into two sets of n — r andr things simultaneously. 


Example 2.3.2 (Selecting Letters). Let us illustrate the relationship between permu- 
tations and combinations, that is, between ordered and unordered selections, by a 
simple example, in which all cases can easily be enumerated. Say we have four let- 
ters A, B,C, D, and want to select two. If order counts, then the possible selections 
are 

AB, AC, AD, BC, BD, CD, 

BA,CA, DA,CB, DB, DC. 

Their number is 4P2 = 4-3 = 12. If we want to disregard the order in which the 
letters are selected, then AB and BA stand for the same combination, also AC and 
CA for another single combination, and so on. Thus the number of selections written 
in the first row above, that is, 6, gives us 4C2. Indeed, (5) = (4-3)/(2- 1) = 6. Inthis 
case, the argument we used for obtaining ,C;, amounts to saying that each unordered 
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selection gives rise to two ordered selections, and there are 12 of the latter, hence 
2-4C2 = 12, and so 4C2 = 12/2 = 6. 

We can also look at this slightly differently: We have 12 permutations. To make 
them into combinations we must identify pairs such as AB and BA with each other. 
Thus, the number of combinations is the number of unordered pairs into which a set 
of 12 objects can be partitioned, and this is, by the definition of division, 12/2. 


The argument above can be generalized as follows. 


Division Principle. /f we have m things and k is a divisor? of m, then we can divide 
the set of m elements into m/k subsets of k elements each. 


Applied to permutations and combinations, this principle says that m = ,, P, per- 
mutations can be grouped into subsets with k = r! elements, with those permutations 
that have the same letters making up each subset, and the number of these subsets 
is ,P,/r!. Since these subsets represent all the combinations, their number is, on 
the other hand, ,,C,. Thus, the division principle can directly give us the previously 
obtained relationship ,C, = ,P,/r!. 


Example 2.3.3 (Three Card Hands). The number of different three-card hands from 
a deck of 52 cards is 


a2 52-51-50 
5203 (*) es ; 


Example 2.3.4 (Committee Selection). In a class there are 30 men and 20 women. In 
how many ways can a committee of 2 men and 2 women be chosen? 

We have to choose 2 men out of 30, and 2 women out of 20. These choices 
can be done in (7) and (a ) ways, respectively. By the multiplication principle, the 
whole committee can be selected in (*)) - (7) = (30 - 29)/(2 - 1)-(20- 19)/(2- 1) = 
15- 29-10-19 = 82,650 ways. 


Exercises 


Exercise 2.3.1. Evaluate 5 P2, 6 P3, 3 P|.5Po0.6Po.- 


Exercise 2.3.2. How many three-letter “words” can be formed, without repetition of 
any letter, from the letters of the word “symbol”? (We call any permutation of letters 
a word.) 


Exercise 2.3.3. Prove that n! =n- (n —1)!. 


Exercise 2.3.4. Evaluate 5C2, 6C3, 3C),5Co, 6Co- 


3 This means that m /k is a whole number. 
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Exercise 2.3.5. List all permutations of 3 letters taken at a time from the letters 
A, B,C, D. Mark the groups whose members must be identified to obtain the com- 
binations of three letters out of the given four; and explain how the division principle 
gives the number of combinations in this case. 


Exercise 2.3.6. In how many ways can a committee of 4 be formed from 10 men and 
12 women if it is to have 


(a) 2 men and 2 women, 

(b) 1 man and 3 women, 

(c) 4 men, 

(d) 4 people regardless of sex? 


Exercise 2.3.7. A salesman must visit any four of the cities A, B,C, D, E, F, start- 
ing and ending in his home city, which is other than these six. In how many ways can 
he schedule his trip? 


Exercise 2.3.8. A die is thrown until a 6 comes up, but only five times if no 6 comes 
up in 5 throws. How many possible sequences of numbers can come up? 


Exercise 2.3.9. In how many ways can 5 people be seated on 5 chairs around a round 
table if 


(a) only their positions relative to each other count (that is, the arrangements ob- 
tained from each other by rotation of all people are considered to be the same), 
and, 

(b) only who sits next to whom counts, but not on which side (rotations and reflec- 
tions do not change the arrangement)? 


Exercise 2.3.10. Answer the same questions as in Exercise 2.3.9, but for 5 people 
and 7 chairs. 


Exercise 2.3.11. How many positive integers are there under 5000 that are 


(a) multiples of 3, 

(b) multiples of 4, 

(c) multiples of both 3 and 4, 

(d) not multiples of either 3 or 4? 


(Hint: Use the division principle adjusted for divisions with remainder!) 


2.4 Some Properties of Binomial Coefficients and the Binomial 
Theorem 


The binomial coefficients have many interesting properties, and some of these will 
be useful to us later, so we describe them now. 
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If we write the binomial coefficients in a triangular array, so that (a goes into 


the first row, (}) and (1) into the second row, (CG) G) and () into the third row, and 


so on, then we obtain the following table, called Pascal’s triangle: 


1 
1 1 
1 2 1 
1 3 3 1 
1 4 6 4 1 
1 5 10 10 5 1 


It is easy to see that each entry other than | is the sum of the two nearest entries 
in the row immediately above it; for example the 6 in the fifth row is the sum of the 
two threes in the fourth row. In general, we have the following theorem. 


Theorem 2.4.1 (Sums of Adjacent Binomial Coefficients). For any positive inte- 


gersrandn>r, 
n—-1 n—-1 n 
+ = ; (2.22) 
r—-—1l r r 


Proof. We give two proofs. To prove this formula algebraically, we only have to 
substitute the expressions for the binomial coefficients, and simplify. For r = 1 the 
left-hand side becomes 


n—-1 n—-1 xj eae n 193 
ia )+( ; )- +(n- yan=(1), (2.23) 
and forr > 1 


Dee Se ED) Me ee ee 


(r — 1)! r!} 
_, le Die = 2) Gear El er 
- r-(r—1)! 

[(n- In -2)- @-r + DI) @—r) 

7 r! 
= DG = 2) Grae ler pa =F) 
7 a 
= Gaver aenrt = (*), (2.24) 
r! r 


An alternative, so-called combinatorial proof of Equation 2.22 is as follows: () 
equals the number of ways of choosing r objects out of n. Let x denote one of the 
n objects. (It does not matter which one.) Then, the selected r objects will either 
contain x or will not. The number of ways of selecting r objects with x is CS) “1, 
since there are n — | objects other than x, and we must choose r — | of those in 
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addition to x, which we can choose in just one way. On the other hand, the number 
of ways of selecting r objects without x is (), because there are n — | objects 
other than x and we must choose r of those. Using the addition principle for these 
two ways of choosing r objects out of n completes the proof. oO 


The next topic we want to discuss is the binomial theorem. 

An expression that consists of two terms is called a binomial, and the binomial 
theorem gives a formula for the powers of such expressions. The binomial coeffi- 
cients are the coefficients in that formula, and this circumstance explains their name. 
Let us first see how they show up in some simple cases. 

We know that 


(a+b) =a* +2ab+b? (2.25) 
and 
(a+b) =a? +3a*b + 3ab? +b°. (2.26) 


The coefficients on the right-hand sides are 1, 2, 1 and 1, 3, 3, 1, and these are 
the numbers in the rows for n = 2 and 3 in Pascal’s triangle. In general we have 


Theorem 2.4.2 (The Binomial Theorem). For any natural number* n, and any 
numbers a, b 


(a +b)” — (pe + (Tarts + (3)an-20? + oe + (")o" 
0 1 2 n 
n 


Proof. Let us first illustrate the proof for n = 3. Then 
(a+b) = (a +b)(a +b)(a +), (2.27) 


and we can perform the multiplication in one fell swoop instead of obtaining 
(a + b) first and then multiplying that by (a + b). When we do both multiplica- 
tions simultaneously, we then have to multiply each letter in each pair of parentheses 
by each letter in the other pairs of parentheses, and add up all such products of three 
factors. Thus the products we add up are obtained by multiplying one letter from each 
expression in parentheses in every possible way. Since we choose from two letters 
three times, we have B3=8 products such as aaa, aab, etc., to add up. Now, some 
of these products are equal to each other, for example, aab = aba = baa = a’b. 
The number of ways in which we can choose the three a’s from the three (a + b)’s is 
one. Thus, we have one a? in the result. The number of ab terms is (;) = 3, since 


4 In fact, the theorem can be extended to arbitrary real exponents as discussed in calculus 
courses, but then the combinatorial meaning shown in the present proof, which is what we 
need, is lost. 
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we can choose the one (a + b) from which the factor b comes in (;) ways. Similarly, 
the number of ab’ terms is () = 3, since we can choose the two (a + b)’s from 


which the two b’s come in () ways. Finally, we have just one b? term. Thus, 
3 3 3\ 2 PY 98s ga 
(a+by =a + 1 a“b+ , ab“ + b°. (2.28) 


To make each term conform to the general pattern, we can write the first and last 
terms as ())a3b° and (ab, and write b! for b and a! for a in the second and third 
terms. Then, for instance, (5) b° = 1 means that there is only one way to select zero 
b’s, and the product with no b is the same as the one multiplied by b°. 

In the general case of (a + b)”, the result will have all possible kinds of terms, in 
which a total of n a’s and b’s are multiplied together: one letter from each of the n 
factors (a+b). If the number of a’s chosen is k, then the number of b’s must be n —k, 
since a total of n letters must be multiplied for each term of the result. Furthermore, 
the coefficient of akb”~* must be (7) , since we can select the k factors (a + b) from 
which we take the a’s in exactly that many ways. Thus the expansion of (a + b)” 
must consist of terms of the form (jay , with k taking all possible values from 
Oton. 0 


We can of course use the binomial theorem for the expansion of binomials with 
all kinds of expressions in place of a and b, as in the next example. 


Example 2.4.1 (A Binomial Expansion). 
Gx= 2)? = Gxr+ (—oy" 
4 4 3 4 2. 2 4 3 4 
= 3x)" + 1 (3x)°(—2) + , (3x)"(—2)° + 3 (3x)(—2)° + (—2) 
= 3474 — 4.39.23 4.6.37. 27x? 4.3.237 4 24 
= 81x4 — 216x> + 216x? — 96x + 16. (2.29) 


Example 2.4.2 (Counting Subsets). If we put a = b = 1 in the binomial theorem, 


then it gives 
n n n n nn 


This can also be seen directly from the combinatorial interpretations of the quan- 
tities involved: If we have a set of n elements, then (6) is the number of its 0-element 
subsets, (i) is the number of its 1-element subsets, and so on; and the sum of these 
is the total number of subsets of the set of n elements, which is 2”, as we know from 
Example 2.2.5. 


Example 2.4.3 (Alternating Sum of Binomial Coefficients). Putting a = 1 and b = 
—1 in the binomial theorem, we obtain 
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(°) 7 (') = (:) es (") oh ee ie (2.31) 


This would be more difficult to interpret combinatorially; we do not do it here 
(but see Exercise 2.4.6.) 4 


There is one other property of binomial coefficients that is important for us; we 
approach it by an example. 


Example 2.4.4 (Counting Ways for a Committee). In Exercise 2.3.6 we asked a ques- 
tion about forming a committee of four people out of 10 men and 12 women. Such a 
committee can have either 0 men and 4 women, or 1 man and 3 women, or 2 men and 
2 women, or 3 men and | woman, or 4 men and 0 women. Since these are the disjoint 
possibilities that make up the possible choices for the committee, regardless of sex, 
we can count their number on the one hand by using the addition and multiplication 
principles, and on the other hand, directly, without considering the split by sex. Thus 


(o)(9) CG) +G)G)* GIG) (Co) =G) 


We can generalize this example as follows: If we have n; objects of one kind and 
n2 objects of another kind, and take a sample of r objects from these, withr < n; 
andr < nz, then the number of choices can be evaluated in two ways, and we get 


ni \ (n2 n\ n2 my \/n2\ ny +n 
(2) ("roe (M)(M)=(ME"). ew 
Exercises 


Exercise 2.4.1. Write down Pascal’s triangle to the row with n = 10. 

Exercise 2.4.2. Use Pascal’s triangle and the binomial theorem to expand (a + b)°. 
Exercise 2.4.3. Expand (1 + x)°. 

Exercise 2.4.4. Expand (2x — 3)°. 

Exercise 2.4.5. What would be the coefficient of x® in the expansion of (1 + x)!°? 


Exercise 2.4.6. Explain the formula (3) - (;) + 3) - (3) = 0 by using the expansion 
of n(A U B UC) from Equation 2.8. 


Exercise 2.4.7. Use the binomial theorem to evaluate 


(a) hao 4 
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(b) Fo (\a* for any x 4 0. 


Exercise 2.4.8. In how many ways can a committee of 4 be formed from 10 men 
(including Bob) and 12 women (including Alice and Claire) if it is to have 2 men 
and 2 women but, 


(a) Alice refuses to serve with Bob, 

(b) Alice refuses to serve with Claire, 

(c) Alice will serve only if Claire does, too, 
(d) Alice will serve only if Bob does, too? 


Exercise 2.4.9. How many subsets does a set of n > 4 elements have that contain 


(a) at least two elements, 
(b) at most four elements? 


Exercise 2.4.10. Generalize Theorem 2.4.1 by considering two special objects x and 
y instead of the single object x in the combinatorial proof. 


2.5 Permutations with Repetitions 


Until now, we have discussed permutations of objects different from each other, ex- 
cept for some special cases to which we will return below. In this section, we consider 
permutations of objects, some of which may be identical or, which amounts to the 
same thing: different objects that may be repeated in the permutations. 

The special cases we have already encountered are the following: First, the num- 
ber of possible permutations of length n out of r different objects with an arbitrary 
number of repetitions, that is, with any one of the r things in any one of the n places 
is r”. (For example the number of two letter “words” made up of a, b, or c is 37: 
aa, ab,ac, ba, bb, bc, ca, cb, cc.) 

The second case we have seen in a disguise is that of the permutations of length 
n of two objects, with r of the first object and n — r of the second objects chosen. 
The number of such permutations is obviously ,C;, since to obtain any one of them 
we may just select the r places out of n for the first object. 

In general, if we have k different objects and we consider permutations of length 
n, with the first object occurring 1; times, the second nz times, and so on, with the 
kth object occurring nz times, then we must have nj +n2+---+n, =n, and the 
number of such permutations is 


n! 
—_———_—__—_.. (2.34) 
ni!n2!---ng! 

This follows at once from our previous counts for permutations and the division 
principle. Since, if all the n objects were different, then the number of their permu- 
tations would be n!. When, however, we identify the n; objects of the first kind with 
each other, then we are grouping the permutations into sets with n;! members in 
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each; and so we must divide the n! by n 1! to account for the indistinguishability of 
the objects of the first kind. Similarly, we must divide the count by 2! to reflect the 
indistinguishability of the nz objects of the second kind, and so on. 

The quantity above is called a multinomial coefficient, and is sometimes denoted 


by the symbol 
n 
( ) (2.35) 
nN, 12,...,Nk 


Note that for k = 2 the multinomial coefficient equals the corresponding bino- 


mial coefficient, that is, 
n n n 
( ) = ( ) = ( ) (2.36) 
ni, 2 n\ n2 


The reason for this relation is that when we have n, objects of one kind and n2 
objects of another kind, then the number of ways of arranging them in a row is the 
same as the number of ways of selecting the n; spaces for the first type from the 
total of nj +2 =n spaces, or the number of ways of selecting the n2 spaces for the 
second type from the same total. 


Example 2.5.1 (Number of Words). How many seven letter words can be made up of 
two a’s, two b’s, and three c’s? 
Here n = 7,k = 3,n, = 2,n2 = 2, and n3 = ng = 3. Thus the answer is 


7 7! 
ee | (2.37) 
2.2.3) > oho 


4 


The reason for calling the quantities above multinomial coefficients is that they 
occur as coefficients in a formula giving the nth power of expressions of several 
terms, called multinomials: 


Theorem 2.5.1 (Multinomial Theorem). For any real numbers x1, x2,... , Xp, and 
any natural number n, 


Gibbet (tata, 238) 
1, 2, °° 


with the sum taken over all nonnegative integer values nj,n2,...,nx such that 
nytngt-:-:-+ny Hn. 


7, Nk 


The proof of this theorem is omitted; it would resemble the proof of the binomial 
theorem. 
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Example 2.5.2 (A Multinomial Expansion). 


(Qetytz)taxtt yt4 7244403 y 4 xy? + x32 4-423 + yz 4 yz) 
+ O(x7y? 4x22? + y2z*) + 12x? yz + xy2z + xyz’), (2.39) 


4 _ 4 4 4 _ 4 at 
40,0) 4!-0!-0! ”° \3,1,0/  3!-1!-0! 


since 


4 4! 4 4! 
2,220 2!-2!-0! 2, 1, 1 2!-1!-1! 
and permuting the numbers in the lower row in any multinomial coefficient leaves 
the latter unchanged. © 


In closing this section, let us consider a problem that can be reduced to one of 
counting permutations with two kinds of indistinguishable objects: 


Example 2.5.3 (Placing Indistinguishable Balls Into Distinguishable Boxes). In how 
many ways can k indistinguishable> balls be distributed into n different boxes? 

If there are k = 2 balls and n = 3 boxes, then the possible distributions can be 
listed as ordered triples of nonnegative whole numbers that add up to two, and which 
give the numbers of balls in the boxes. They are (2, 0, 0), (0, 2, 0), (0,0, 2), (1, 1, 0), 
(1,0, 1), and (0, 1, 1); thus in this case the answer is 6. 

In the general case the problem can be solved by the following trick: 

Each distribution can be represented by a sequence of circles and bars, with the 
circles representing the balls, and the bars the walls of the boxes (we put only one 
bar as a wall between two boxes). For instance, Figure 2.5 shows the distribution 
(0,3,1,2,0, 2) of 8 balls into 6 boxes arranged in a row. 

Now, if there are 6 boxes, then we have 7 bars. Two of those must be fixed at the 
ends, and the remaining 5 can have various positions among the balls. 

In general, if we have n boxes, then we can choose the positions of n — 1 bars 
freely. Thus, the problem becomes that of counting the number of permutations of 
n— | bars and k circles. We know that the number of such permutations is ( aS 


n—-l,k 
("“.**) = (""'**). This expression is the answer to our question. If k = 2 and 


n = 3, then it becomes (3) = 6, as we have seen above by a direct enumeration. 


LILOOOIOIOOIIOOI 


Fig. 2.5. 


5 Actually, the balls may be distinguishable, but we may not want to distinguish them. In 
some applications, for instance involving distribution of money to people, all we care about 
is how many dollars someone gets, not which dollar bills. 
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Exercises 


Exercise 2.5.1. In how many ways can we form six-letter words 


(a) from a’s and/or b’s, 

(b) from two a’s and four b’s, 

(c) from two a’s, one b and three c’s, 

(d) from two a’s and four letters each of which may be b or c? 


Exercise 2.5.2. On how many paths can a rook move from the lower left corner of 
a chessboard to the diagonally opposite corner by moving only up or to the right at 
each step? 


Exercise 2.5.3. (a) How many permutations are there of the letters of the word “‘suc- 
cess”? 

(b) How many of the above have exactly three s’s together (Hint: Consider sss as if 
it were a single letter.) 

(c) How many have two or three s’s together? (Hint: Regard ss as a single letter.) 

(d) How many have exactly two s’s together? 


Exercise 2.5.4. Prove, both algebraically and combinatorially, (that is, in terms of 
selections) that ifn, +n2+n3 =n, then ( if ) = de Os 


Ny, Nz, N3 nz 
Exercise 2.5.5. What is the coefficient of 


(a) a*b>c? in the expansion of (a+b+c+ d)', 
(b) abc? in the expansion of (2a — 3b +c — d)’? 


Exercise 2.5.6. Expand (2 + 3 + 14 by the multinomial theorem, and show that the 
terms add up to 6* = 1, 296. 


Exercise 2.5.7. (a) In how many ways can 10 cents be distributed among 3 children? 
(All that matters is how much each child gets, not which coins, that is, cents are 
considered indistinguishable.) 

(b) In how many ways if each child is to get at least one cent? (Hint: From the 
spaces between circles choose some for bars, or first give | cent to each and then 
distribute the remaining 7 cents.) 


Exercise 2.5.8. In how many ways can k indistinguishable balls be distributed into 
n <k different boxes if each box is to get at least one ball? (Hint: From the spaces 
between circles choose some for bars.) 


Exercise 2.5.9. In how many ways can k indistinguishable balls be distributed into 
n > k different boxes if no box is to get more than one ball? 


Exercise 2.5.10. How many distinct terms are there in the multinomial expansions 
of 


(a) a@+b+c)°®, 
(b) (a+b+c+d)°? 


Explain! (Hint: Use Example 2.5.3.) 
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Probabilities 


3.1 Relative Frequency and the Axioms of Probabilities 


We begin our discussion of probabilities with the definition of relative frequency, 
since this notion is very concrete and probabilities are, in a sense, idealizations of 
relative frequencies. 


Definition 3.1.1 (Relative Frequency). If we perform an experiment n times (each 
performance is called a trial) and the event A occurs in vy trials, then the ration 4/n 
is called the relative frequency of A in the n trials, and will be denoted by f4. 


For example if we toss a coin n = 100 times, and observe heads ny = 46 times, 
then the relative frequency of heads in those trials is fy = ny/n = 46/100 = 0.46. 

For two mutually exclusive events A and B, the relative frequency of A U B 
in n trials turns out to be the sum of the relative frequencies of A and B, because 
NAUB =Na +e by the addition principle, and so faug = fat fp. 

As mentioned in the Introduction, we assign probabilities to events in such a way 
that the relative frequency of an event in a large number of trials should approximate 
the probability of that event. We can expect this to happen only if we define proba- 
bilities so that they have the same basic properties as relative frequencies. Thus we 
state the following definition. 


Definition 3.1.2 (Probabilities). Given a sample space S and a certain collection F 
of its subsets, called events,! an assignment P of anumber P(A) to each event A in F 
is called a probability measure, and P(A) the probability of A, if P has the following 
properties: 


1. P(A) = 0 for every A, 

2. P(S) = 1, and 

3. P(A, U A2 U---) = P(A1)+ P(A2) +--+ for any finite or countably infinite set 
of mutually exclusive events Aj, A2,.... 


! Tf S is a finite set, then the collection F of events is taken to be the collection of all subsets 
of S. If S is infinite, then F must be a so-called sigma-field, which we do not discuss here. 
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The sample space S together with F and P is called a probability space. 


The properties of P in the definition are also called the axioms of the theory. 
Furthermore, if Axiom 3 were stated for only two sets, then from that form it could 
be proved for an arbitrary finite number of sets (finite additivity) by mathematical 
induction, but not for an infinite number (countable additivity), which we also need. 

From this definition, several other important properties of probabilities follow 
rather easily, which we give as theorems. In each of these theorems an underlying 
arbitrary probability space will be tacitly understood. 


Theorem 3.1.1 (The Probability of the Empty Set Is 0). In any probability space, 
P(A) = 0. 


Proof. Consider an event A. Then A U % = A, and A and @ are mutually exclusive, 
since AM% = @. Hence P(A U %) = P(A) on the one hand, and on the other, 
by Property 3 applied to Ay = A and Az = Y, P(AU @) = P(A) + P(). Thus 
P(A) = P(A) + P(Y), and so P(Y) = 0. oO 

Note, however, that the empty set need not be the only set with zero probability, 
that is, in some probability spaces we have events A 4 4 for which P(A) = 0. There 
is nothing in the axioms that would prevent such an occurrence. In fact, such events 
need not be impossible. For instance, if the experiment consists of picking a point 
at random from the interval [0,1] of real numbers, then each number must have 
zero probability, otherwise Axiom 3 would imply that the sum of the probabilities 
of an infinite sequence of such numbers is infinite, in contradiction to Axiom 2. 
(Why?) To make useful probability statements in this case, we assign probabilities 
to subintervals of nonzero length of [0, 1], rather than to single numbers. Details will 
be discussed in Chapter 4 and thereafter. 


Theorem 3.1.2 (The Probability of the Union of Two Events). For any two events 
Aand B, 


P(AU B) = P(A) + P(B) — P(ANB). (3.1) 


We leave the proof as an exercise; it follows the proof of the analogous property 
of n(A U B) (Theorem 2.1.2). 


Theorem 3.1.3 (The Probability of the Union of Three Events). For any three 
events, 


P(AU BUC) = P(A) + P(B) + P(C) — P(AB) — P(AC) — P(BC) + P(ABC). 
Proof. We apply Theorem 3.1.2 three times: 
P(AU BUC) = P(AU (B UC)) = P(A) + P(B UC) — P(A(B UC)) 
= P(A) + P(B) + P(C) — P(BC) — P(ABU AC) 
= P(A) + P(B) + P(C) — P(BC) — [P(AB) + P(AC) — P(ABAC)] 
= P(A) + P(B) + P(C) — P(AB) — P(AC) — P(BC) + P(ABC). 
Oo 
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Theorem 3.1.4 (Probability of Complements). For any event A, 
P(A) = 1— P(A). 


Proof. ANA = and AUA=S by the definition of A. Thus, by Axiom 3, P(S) = 
P(A U A) = P(A) + P(A). Now, Axiom 2 says that P(S) = 1, and so, comparing 
these two values of PCS), we obtain P(A) + P(A) = 1. | 


Theorem 3.1.5 (Probability of Subsets). /f A C B, then P(A) < P(B). 


Proof. If A C B, then B = AU (BN A), with A and BO A being disjoint. Thus, by 
Axiom 3, P(B) = P(A)+ P(BN A), and by Axiom 1, P(BM A) => 0. Therefore P(B) 
is P(A) plus a nonnegative quantity, and so is greater than or equal to P(A). a 


Corollary 3.1.1. P(A) < 1 for all events A. 


Proof. In Theorem 3.1.5 take B = S. Since A C S for every event A and P(S) = 1 
by Axiom 2, Theorem 3.1.5 gives P(A) < 1. Oo 


Example 3.1.1 (Drawing a Card). For drawing a card at random from a deck of 52 
cards, we consider the sample space S made up of the 52 elementary events corre- 
sponding to the 52 possible choices of drawing any one of the cards. We assign 1/52 
as the probability of each of the elementary events, and for any compound event A 
we define its probability P(A) as the number (A) of the elementary events that make 
up A times 1/52, that is, as 


1 
P(A) =n(A) - =. (3.2) 


For example, the probability of drawing a spade is 13 - (1/52) = 1/4, since there 
are 13 spades and the drawing of each spade is an elementary event, the 13 of which 
make up the event A = {a spade is drawn}. 

It is easy to verify our axioms for this case: 

1. Obviously, the assignment, Equation 3.2, makes every P(A) nonnegative. 
2.P(S) = 1, since S is made up of all the 52 elementary events, and so P(S) = 


52-(1/52) = 1. 
3. By Theorem 2.1.1, for k pairwise disjoint sets A;, A2,... , Ag, Equation 2.2 
gives 
n(A, U Az U---U Ax) n(A,)  n(A2) n(Ax) 
— tee 33 
52 52 us 52 aan 52 a2) 
and, by Equation 3.2, Equation 3.3 becomes 
P(A, U Ag U---U Ag) = P(A]) + P(A2) + ++ + P(AK). (3.4) 
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As we did in the special case of the above example, we can prove 


Theorem 3.1.6 (Assignment of Probabilities in a Finite Sample Space). Jn a finite 
sample space we obtain a probability measure by assigning nonnegative numbers 
whose sum is I as probabilities of the elementary events and, for general A, by 
taking the sum of the probabilities of the elementary events that make up A as P(A). 


Example 3.1.2 (An Assignment of Unequal Probabilities). Let S = {51, 82, 53, sa} 
and assign probabilities to the elementary events* as P(s;) = 1/2, P(s2) = 1/3, 
P(s3) = 1/6, P(s4) = 0, and, for general A, take as P(A) the sum of the probabilities 
of the elementary events that make up A. For instance, if A = {5), s2}, then take 
P(A) = 1/2 + 1/3 = 5/6. We could easily verify the axioms for this assignment. 

How could we realize an experiment that corresponds to this probability space? 
One way of doing this would be to consider picking a number at random from the 
interval [0, 1] of real numbers (as random number generators do on computers, more 
or less) and letting sj = [0, 1/2), sy = [1/2, 5/6), s3 = [5/6, 1), s4 = {1}. 


Theorem 3.1.6 has a very important special case, which we state as a corollary: 


Corollary 3.1.2. If a sample space consists of n elementary events of equal prob- 
ability, then this common probability is 1/n and, if an event A is the union of k 
elementary events, then P(A) = k/n. 


It is customary to call the k outcomes that make up A the outcomes favorable 
to A, and to call n the total number of possible outcomes. Thus, for equiprobable 
elementary events, the assignment can be summarized as 


favorable 


P(A) = 
i total 


For a long time this formula was considered to be the definition of P(A), and is 
still called the classical definition of probabilities. Example 3.1.1 provided an illus- 
tration of this: The probability of drawing a spade from a deck of 52 cards, if one 
card is drawn at random (i.e., with equal probability for each card), is 13/52, since 
k= 13 andn =52. 

Note, however, that the probability of drawing a spade is not 13/52 under all con- 
ditions. Corollary 3.1.2 ensures this value only if all cards have the same probability 
of being drawn, which will not be true if the deck is not well shuffled or if we use 
some special method of drawing. In fact, there is no way of proving that all cards 
must have the same probability of being drawn, no matter how we do the shuffling 
and drawing. The equal probabilities in this case are assignments based on our expe- 
rience. In every case, some probabilities must somehow be assigned, and the theory 
is only intended to show how to calculate certain probabilities and related quantities 
from others (also see the Introduction). For instance, Theorem 3.1.6 and its corollary 


2 Itis customary to omit the braces in writing the probabilities of the elementary events, such 
as writing P(s;) instead of the correct, but clumsy, P({s1}). 
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tell us how to calculate the probabilities of compound events from those of the ele- 
mentary events. Thus, the so-called classical definition of probabilities is not really a 
definition by present-day standards, but a very useful formula for the calculation of 
probabilities in many cases. 


Exercises 


Exercise 3.1.1. We draw a card at random from a deck of 52. Let A = {the card 
drawn is a spade}, B = {the card drawn is a face card}, C = {the card drawn is a 
King}. Find: 


(a) P(A), 
(b) P(B), 

(c) PC), 

(d) P(AN B), 
(ec) P(AUB), 
() BNC), 
(2) PBNC), 
(h) P(BUC). 


Exercise 3.1.2. We throw two dice as in Example 1.3.3, a black one and a white one. 
If b denotes the result of the throw of the black die, and w that of the white die, then 
let A= {b+ w = 7}, B = {b < 3}, C = {w > 4}. Find the eight probabilities listed 
in Exercise 3.1.1 above, but using this assignment of A, B and C. 


Exercise 3.1.3. Prove Theorem 3.1.2 using the axioms. 
Exercise 3.1.4. Prove that, for any two events A and B, P(AB) > P(A)+ P(B) — 1. 


Exercise 3.1.5. When is P(A — B) = P(A)— P(B)? Prove your answer, paying at- 
tention to events with zero probability (see also Example 3.1.2) other than %. 


Exercise 3.1.6. For any two events A and B, the expression AB U AB is called 
their symmetric difference and corresponds to the “exclusive or” of the correspond- 
ing statements, that is, to “one or the other but not both.” Find an expression for 
P(AB U AB) in terms of P(A), P(B) and P(AB) and prove it. 


Exercise 3.1.7. Prove, for arbitrary events and any integern > 1, 

(a) P(AU B) < P(A)+ P(B), 

(b) PAU BUC) < P(A)+ P(B)+ PCC), 

(c) P(Ujay Ai) S iz P(Ai). 

Exercise 3.1.8. Consider the sample space S = {a, b, c, d} and assign probabilities 
to the elementary events as P({a}) = 1/7, P({b}) = 2/7, P({c}) = 4/7, P({d}) = 0. 


(a) Compute the probabilities of all compound events, as described in Theorem 
3.1.6. 
(b) Find two sets A and B such that AB 4 B, but P(AB) = P(B). 
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3.2 Probability Assignments by Combinatorial Methods 


In this section, we consider several examples of probability assignments to complex 
events, under the assumption that the elementary events are equiprobable. Thus, we 
use the classical definition and, because of the complexity of the problems, the com- 
binatorial methods developed in Chapter 2. 


Example 3.2.1 (Probability of Drawing two Given Cards). We draw two cards from 
a deck of 52 without replacement. What is the probability of drawing a King and an 
Ace without regard to order? 

We solve this problem in two ways. 

First, the total number of ways of drawing two cards with regard to order is 52-51 
and there are 4” ways of drawing a King first and an Ace second, and another 47 ways 
of drawing an Ace first and a King second. Thus 


2-42 
52:51 


The other way to solve this problem is to start by disregarding the order. Then 
the total number of possible outcomes is ( ) , which are again equally likely, and the 


P(K and A) = (3.5) 


number of ways of choosing one King out of four is (7) and of one Ace out of four 
also ‘ee Thus, 


4\ (4 
P(K py DM) | = ae (3.6) 
aie a (2) ~ GR SNiO-1) S251 
2 


the same as before. 


Example 3.2.2 (Probability of Head and Tail). We toss two coins. What is the prob- 
ability of obtaining one Head and one Tail? 

If we denote the outcome of the toss of the first coin by Hj and 7), and of the 
second by H2 and 7 , then the possible outcomes of the toss of both are the sets 
{H,, Ho}, {A, To}, {T1, H2}, {T1, T2}. These outcomes are equally probable, and 
the second and third ones are the favorable ones. Thus P(one H and one T) = 2/4 = 
1/2. 

For two successive tosses of a single coin instead of simultaneous tosses of two 
coins, the possible outcomes could be listed exactly the same way, with Hy and T; 
denoting the result of the first toss, and Hz and T> that of the second toss, or more 
simply as HH, HT, TH, and TT, and so the probabilities remain the same. 


Notice, that we cannot solve this problem by the alternate method of combining 
the ordered pairs into unordered ones, as in Example 3.2.1, since {HH}, {HT, TH}, 
and {7T} are not equally likely. Their probabilities are 1/4, 1/2 and 1/4, respectively. 
By ignoring the inequality of the probabilities of the elementary events we would get 
P(one H and one 7) = 1/3, which is incorrect. 
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Example 3.2.3 (Six Throws of a Die). A die is thrown six times. What is the proba- 
bility of obtaining at least one six? (In such problems, with coins, cards and dice, it 
is always assumed that all elementary outcomes are equally likely.) 

It is easiest to calculate this probability by using Theorem 3.1.4, that is, from 
P(at least one six) = 1 — P(no six). Now the total number of possible (ordered) 
outcomes is 6°, and since on each throw there are 5 ways of obtaining something 
other than six, in six throws we can get numbers other than six in 5° ways. Thus, 
P(at least one six) = 1 — 5°/6° ~ 0.665. 


Notice, that in this problem, as in the previous one, we must use ordered out- 
comes, because the unordered ones would not be equally likely, which is a prerequi- 
site for computing probabilities by the classical definition. 


Example 3.2.4 (Sampling Good and Bad Items without Replacement). In a batch of 
N manufactured items there are N; good ones and N> defective ones, with Nj +N2 = 
N. We choose a random sample of n items without replacement, that is, once an item 
is chosen, we take it out of the pool from which the next items are picked. Here n is 
called the size of the sample. We ask: What is the probability of the sample having 
ny, good items and 72 bad ones, where nj + n2 =n? 

We solve this problem with unordered selections. (It could be done with ordered 
selections as well, see Exercise 3.2.18.) The total number of equally probable ways 
of choosing n items out of N different ones is 7) , the number of ways of choosing 


n, good ones out of Nj is C) , and that of nz defectives out of N2 is ea Thus, the 


required probability is given by 
N1\(N2 
ny) \n2 


N 

(7) 

We have used the notation p(n1;n, Nj, N2) for this probability, since it is the 
probability of the sample containing n good items under the given experimental data 
of sample size n, and N; and N2 good and bad items in the total population (12 is 
given by n—n 1). The variable n; can take on any nonnegative integer value satisfying 
ny <n,ny < Nj andn—ny, < Ny, that is, max(0,n — No) < n, < min(n, Nj). 
(See Exercise 3.2.19.) Since the events described by the different values of n; are 
mutually exclusive and their union is the sure event, the above probabilities sum to 
1 as n, varies from max(0,n — N2) to min(n, Nj), for any fixed values of n, Ny 
and N>. (Also see Equation 2.33.) Thus, the above formula describes how the total 
probability | is distributed over the events corresponding to the various values of n1. 

Whenever we give the probabilities of disjoint events whose union is the whole 
sample space, we call such an assignment of probabilities a probability distribution. 
The distribution just given is called the hypergeometric distribution with parameters 
n, N and Np. 


p(nisn, N1, N2) = (3.7) 


Example 3.2.5 (Sampling with Replacement). Let us modify the previous problem 
by asking what the probability of obtaining n; good items is if we choose a random 
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sample of n items with replacement from N, good items and N>2 bad ones, that is, 
we choose one item at a time, note whether it is good or bad, and replace it in the 
population before choosing the next one. 

Since in each of the n steps of the sampling we have N = N,+N> items to choose 
from, the total number of equally probable elementary events is N”. Next, we must 
count how many of these are favorable, that is, how many elementary events have 
n, good items and nz bad ones. Now, at each of the n steps of the sampling, we can 
choose either a good or a bad item, but in n; of them, we must choose a good one. 
We can choose these n; steps in G ways. Then at each of these n; steps we have a 
choice of Nj items, and at each of the remaining nz = n — n, steps a choice of No 
items, for a total of N i . Ny > choices. Thus the required probability is 


n 
Nv! NS? 
(") 1 2 


f(nisn, M1, N2) = We 


(3.8) 
If we write N"” = N+" — N"\N"2, and replace nz by n — nj, then we can 
write the above formula as 


Ny \™ (No \™ 
Fonin N1,Na) =(").(%) (=) (3.9) 


Here Nj1/N is the probability of choosing a good item at any given step, and 
N2/N is that of choosing a bad item. It is customary to denote these probabilities by 
p and g (with p + q = 1), and then the required probability of obtaining n; good 
items can be written as 


f(r n, p) = ("omar (3.10) 


Since these probabilities are the terms of the expansion of (p + q)” by the bino- 
mial theorem, and they are the probabilities of disjoint events (the different values 
of 1;) whose union is the sure event, they are said to describe the so-called binomial 
distribution with parameters n and p. 

It is easy to check that, indeed, 


» ("Joma =(pt+gy"=L (3.11) 


n,=0 nt 


Example 3.2.6 (The Birthday Problem). What is the probability that at least two peo- 
ple, out of a given set of n persons, for 2 < n < 365, have the same birthday? 
Disregard February 29, and assume that the 365” possible birthday combinations are 
equally likely. 

If all the n persons had different birthdays, then there would be 365 choices for 
the birthday of the first person, 364 for that of the second, and so on. Thus, 


365 Pn 


P(at least two h birthd. =1- : 
(at least two have same birthday) 365" 
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It is interesting and very surprising that this probability is about 0.5 for as few as 
23 people, and about 0.99 for 60 people. 


Example 3.2.7 (Seating Men and Women).m men and n women are seated at random 
in arow onm +n chairs, with m <n. What is the probability that no men sit next to 
each other? 

The total number of possible arrangements is (m +n)! and the number of favor- 
able arrangements can be obtained as follows. 

Consider any arrangement of the n women in a row. Then there are n + 1 spaces 
between or around them, from which we must choose m for the men. Thus, we have 
(ea) choices for the seats of the men once the women’s order is set. For any of the 


m 
just counted choices, the men can be ordered in m! ways and the women in n! ways, 


and so the number of favorable arrangements is ("*"\min}. Hence 
om In! 
P(no men sit next to each other) = —“*—_. (3.12) 
(m+n)! 


Example 3.2.8 (Four of a Kind in Poker). In a variant of the game of poker, play- 
ers bet on the value of a five-card hand dealt to them from a standard 52-card deck. 
The value of the hand is determined by the type of combination of cards. In playing 
the game, it is helpful to know the probabilities of various combinations. In “four 
of a kind,” the player’s hand consists of all four cards of a certain kind, say all four 
Aces, plus one other card. The probability of being dealt four of a kind can be com- 
puted with both ordered and unordered selection, because the unordered selections 
are equiprobable, each consisting of 5! ordered selections. 

With ordered selection, the total number of possible hands is 5 Ps and the number 
of favorable hands is 13 - 48 -5!, since the four like cards can be chosen 13 ways, the 
odd card can be any one of the remaining 48 cards, and any one of the 5! orders of 
dealing the same cards results in the same hand. Thus, 

13- 48-5! 


P(four of a kind) = ———~" ~ 0.00024. (3.43) 
52P5 


With unordered selection, the total number of possible hands is (2) (these hands 
are now equally likely), and the number of favorable hands is 13 - 48, since the four 
like cards can be chosen 13 ways and the odd card can be any one of the remaining 
48 cards. (Now we do not multiply by 5! because the order does not matter.) Thus 


; 13 - 48 
P(four of a kind) = —._. (3.14) 


52 

5 
Example 3.2.9 (Two Pairs in Poker Dice). The game of poker dice is similar to poker, 
but uses dice instead of cards. We want to find the probability of obtaining two pairs 


with five dice, that is, a combination of the type x, x, y, y, z in any order, with x, y, z 
being distinct numbers from | to 6. 
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Now the total number of possible outcomes is 6°. (For this problem, we must 
use ordered outcomes, because the unordered ones would not be equally likely.) For 
the favorable cases, the numbers x and y can be chosen (G) = 15 ways, and the 
number z four ways. Furthermore, the number of ways x, x, y, y, z can be ordered is 


(s 3 4) = 30. Thus; 


. 15-4-30 
P(two pairs) = — = 0.23. (3.15) 


Exercises 


Exercise 3.2.1. From a deck of cards use only AS, AH, KS, K H and choose two of 
these cards without replacement. 


(a) List all possible ordered pair outcomes. 

(b) Using the above, find the probability of obtaining an Ace and a King in either 
order. 

(c) Find the same probability by using unordered pairs. 

(d) Explain why the unordered pairs have equal probabilities unlike those in Example 
3.2.2. 


Exercise 3.2.2. If in Exercise 3.2.1 the drawing is done with replacement, find the 
probability of obtaining an Ace and a King. Can you find this probability by counting 
unordered pairs? Explain. 


Exercise 3.2.3. Explain why in Example 3.2.3 we did not get P(at least one six) = 1, 
in spite of the fact that on each throw the probability of getting a six is 1/6, and 6 
times 1/6 is 1. 


Exercise 3.2.4. What is the probability that a 13-card hand dealt from a deck of 52 
cards will contain 


(a) the Queen of spades, 
(b) five spades and 8 cards from other suits, 
(c) five spades, five hearts, two diamonds and the Ace of clubs? 


Exercise 3.2.5. Three dice are rolled. What is the probability that they show different 
numbers? 


Exercise 3.2.6. m men and n women are seated at random in a row on m +n chairs. 
What is the probability that all the men sit next to each other? 


Exercise 3.2.7. m men and n women are seated at random around a round table on 
m +n chairs. What is the probability that all the men sit next to each other? 


Exercise 3.2.8. An elevator in a building starts with 6 people and stops at 8 floors. 
Assuming that all permutations of the passengers getting off at various floors are 
equally likely, find the probability that at least two of them get off on the same floor. 
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Exercise 3.2.9. In the Massachusetts Megabucks game, a player selects 6 distinct 
numbers from | to 42 on a ticket, and the Lottery Commission draws 6 distinct 
numbers at random from | to 42. If all the player’s numbers match the drawn ones, 
then s/he wins the jackpot and, if 5 numbers match, then a smaller prize is won. Find 
the probability of each event. 


Exercise 3.2.10. A random sample of size 10 is chosen from a population of 100 
without replacement. If A and B are two individuals among the 100, what is the 
probability that the sample will contain 


(a) both, 

(b) neither, 

(c) A, 

(d) either A or B, but not both? 


Simplify the answers. 


Exercise 3.2.11. Three integer digits (0, 1,... , 9) are chosen at random with repe- 
titions allowed. What is the probability that 


(a) exactly one digit will be even, 
(b) exactly one digit will be less than 3, 
(c) exactly two digits will be divisible by 3? 


Exercise 3.2.12. Two cards are dealt from n decks of 52 cards mixed together. 
(Mixing several decks is common in the game of twenty-one in casinos.) Find 
the probability of getting a pair, that is, two cards of the same denomination, for 
n= 1,2,4,6,8. 


Exercise 3.2.13. Compute the probability that a poker hand (five cards) dealt from a 
deck of 52 cards contains five different denominations (that is, no more than one of 
each kind: no more than one ace, one 2, etc.). 


Exercise 3.2.14. Compute the probability that a poker hand dealt from a deck of 52 
cards contains two pairs. 


Exercise 3.2.15. Compute the probability that a poker hand dealt from a deck of 52 
cards is a full house, that is, contains a pair and a triple (that is, x, x, y, y, y). 


Exercise 3.2.16. Compute the probability that in poker dice we get four of a kind. 
Exercise 3.2.17. Compute the probability that in poker dice we get a full house. 


Exercise 3.2.18. Show combinatorially that the probability in Example 3.2.4 can be 
obtained by using ordered selections as 
n 
11, no N Pry No Pro 


NPh 


p(nizn, Ny, No) = ; (3.16) 


and show algebraically that this quantity equals the one obtained in Equation 3.7. 
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Exercise 3.2.19. Prove that the four inequalities 0 < nj < n,nyj < N; andn — 
n, < No, together, are equivalent to the double inequality max(0,m — N2) < ni < 
min(n, Nj). 


Exercise 3.2.20. A 13-card hand is dealt from a standard deck of 52 cards. What is 
the probability that 


(a) it contains exactly 3 spades and all four Aces, 
(b) at least 3 of each suit? 


3.3 Independence 


The calculation of certain probabilities is greatly facilitated by the knowledge of any 
relationships, or lack thereof, between the events under consideration. In this section 
we want to examine the latter case, that is, the case in which the occurrence of one 
event has no influence on the probability of the other’s occurrence. We want to call 
such events independent of each other, and want to see how this is reflected in the 
probabilities. We begin with two examples. 


Example 3.3.1 (Repeated Tosses of Two Coins). Suppose we toss two coins repeat- 
edly. We describe this experiment by the sample space S = {HH, HT, TH, TT}, and 
want to estimate the relative frequency of HH. Of course, we know that it should be 
about 1/4, but we want to look at this in a novel way. We can argue that the first coin 
shows #H in about 1/2 of the trials, and since the outcome of the first coin’s toss does 
not influence that of the second, the second coin shows H in not only about 1/2 of 
all trials, but also among those in which the first coin turned up H. Thus, HH occurs 
in about of 1/2 of 1/2, that is, in about 1/4 of the trials. So P(HH) = P(the first coin 
shows up #7) - P(the second coin shows up #7), that is, the probability of both events 
occurring equals the product of the probabilities of the separate events. If we denote 
the event {the first coins shows H} = {HH, HT} by A, and the event {the second 
coin shows H} = {HH, TH} by B, then {HH} = AN B, and the above result can be 
written as P(AB) = P(A)-P(B). 


Example 3.3.2 (Two Dice). We throw two dice, a black and a white one. The proba- 
bility of neither of them showing a six is 5*/6*, which can be written as (5/6) - (5/6). 
Now P(b # 6) = 5/6, Pw 4 6) = 5/6, and so Pb # 6 and w ¥ 6) = 
P(b # 6)-P(w F 6). 

Again, the probability of both one event and the other occurring equals the prod- 
uct of the probabilities of the two events. This relation is illustrated by the diagram 
of Figure 3.1 in which the one shading represents {b 6}, the other {w ¥ 6}, and the 
doubly shaded 5 x 5 square represents {b 4 6 and w ¢ 6} = {b 4 6} N {w F¥ 4}. If 
we consider the length of each side of the big square to be one unit, then the length of 
the segment corresponding to b ¥ 6 is 5/6, which is also the area of the correspond- 
ing vertical strip of 5 - 6 = 30 small squares. Thus, both this length and area have 
the same measure as the probability of {b 4 6}. (The length can be thought of as 
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Fig. 3.1. 


representing the probability of {b ¢ 6} in the 6-point sample space for b alone, and 
the area as representing P(b ¥ 6) in the 36-point sample space for b and w together.) 
Similarly P(w ¥ 6) shows up as a vertical length of 5/6 units, and also as the area 
of the corresponding horizontal 6 x 5 strip. P(b ¢ 6 and w ¥ 6) shows up only as 
an area, namely that of the corresponding 5 x 5 square. 


From these examples, we abstract the following definition: 


Definition 3.3.1 (Independence of Two Events). Two events A and B are said to be 
(statistically) independent? if 


P(AB) = P(A) - P(B). (3.17) 


The main use of this definition is in the assignment of probabilities to the joint 
occurrence of pairs of events that we know are independent in the every-day sense 
of the word. Using this definition, we make them statistically independent, too, as in 
the following example. 


Example 3.3.3 (Distribution of Voters). Assume that the distribution of voters in a 
certain city is as described in the two tables below. 


Party affiliation: | Republican | Democrat | Independent 


% of all voters: 25 40 35 


Age group: Under 30 | 30 to 50 | Over 50 
% of all voters: 30 40 30 


3 Note that, terminology notwithstanding, it is the events with their probabilities that are 
defined to be independent here, not just the events themselves. 
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The probability of a randomly picked voter belonging to a given group is the 
decimal fraction corresponding to the group’s percentage in the table, and the two 
tables each describe a probability distribution on the sample spaces S$; = {Republican, 
Democrat, Independent} and $2 = {Under 30, 30 to 50, Over 50}, respectively. 

Assuming that party affiliation is independent of age, we can find each of the 
nine probabilities of a randomly picked voter belonging to a given possible classifi- 
cation according to party and age. These probabilities can be obtained according to 
Definition 3.3.1 by multiplying the probabilities (that were given as percentages) of 
the previous tables. The products are listed in the next table, describing a probability 
distribution on the sample space S = S; x Sp. 


Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 0.075 0.12 0.105 0.30 
30 to 50 0.10 0.16 0.14 0.40 
Over 50 0.075 0.12 0.105 0.30 
Any age 0.25 0.40 0.35 1 


The probabilities in this table are called the joint probabilities of party affilia- 
tion and age group, and the probabilities given in the first two tables are called the 
marginal probabilities of the two-way classification, because they are equal to the 
probabilities in the margins of the last table. For instance, P(any age M Republican) 
= 0.25 in the nine-element sample space S$ = S; x Sz, equals P(Republican) = 0.25 
in the three-element sample space S;. Notice that the marginal probabilities are the 
row and column sums of the joint probabilities of the nine elementary events, and all 
add up to |, of course. 


The notion of independence can easily be extended to more than two events: 


Definition 3.3.2 (Independence of Several Events). Let A;, Az, ... be any events. 
We say that they are independent (of each other), if for all possible sets of two or 
more of them, the probability of the intersection of the events in the set equals the 
product of the probabilities of the individual events in the set, that is, 


P(A, 9 Az) = P(A1)P(A2), P(A1 9 A3) = P(A1)P(A3), - -- 
P(A, M A2M A3) = P(A1)P(A2)P(A3), ... 


Note that it is not enough to require the product formula just for the intersections 
of all pairs of events or just for the intersection of all the events under considera- 
tion, but we must require it for the intersections of all possible combinations. (See 
Exercises 3.3.3 and 3.3.4.) 

A frequent misconception is to think that independence is a property of individual 
events. No: it is a relation among the members of a set of at least two events. 
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We can use this definition to derive the formula of the binomial distribution anew 
in a very general setting: 


Example 3.3.4 (Binomial Distribution). Consider an experiment that consists of n 
trials. In each trial we have 


1. two possible outcomes, which we call success and failure, 

2. the trials are independent of each other and, 

3. the probability of success is the same number p in each trial, while the probabil- 
ity of failure isg = 1— p. 


Such trials are called Bernoulli trials.4 We ask for the probability b(k; n, p) of 
obtaining exactly & successes in the n trials. Now, by the assumed independence, the 
probability of having k successes and n — k failures in any fixed order is p<qg”*, 
and since the k successes and n — k failures can be ordered in (7) mutually exclusive 
ways, 


b(k; n, p) = (feta. (3.18) 


Thus we have obtained the same binomial distribution as in Example 3.2.5, but 
in a more general setting. 

The great importance of this distribution stems from the many possible applica- 
tions of its scheme. Success and failure can mean Head or Tail in coin tossing, win- 
ning or losing in any game, curing a patient or not in a medical experiment, people 
answering yes or no to some question in a poll, people with life insurance surviving 
or dying, etc. 


Example 3.3.5 (de Méré’s Paradox). In the seventeenth century a French nobleman, 
the Chevalier de Méré, posed the following question to the famous mathematician 
Blaise Pascal: If you throw a die four times, he said, gamblers know from experience 
that the probability of obtaining at least one six is a little more than 1/2, and if you 
throw two dice twenty-four times, the probability of getting at least one double-six 
is a little less than 1/2. How is it possible that you do not get the same probability 
in both cases, in view of the fact that P(double-six for a pair of dice) = 1/36 = 
(1/6)-P(a six for a single die), but you compensate for the factor of 1/6 by throwing 
not 4 but 6 - 4 = 24 times when using two dice? 

Well, the facts do not lie, and so there must be a mistake in the argument. Indeed, 
there is one in the last step: If we multiply the number of throws by 6, the probability 
of getting at least one double-six is not 6 times what it is in 4 throws or 24 times 
what it is in one throw.° 


4 Named after one of the founders of the theory of probability, Jacob Bernoulli (1654-1705), 
the most prominent member of a Swiss family of at least six famous mathematicians. 

5 Note, however, that such a multiplication rule does hold for expected values. In this case, 
the expected number of double-sixes in n throws is n times the expected number in one 
throw, as we shall see in Section 5.1. 
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Applying de Méré’s argument to throws of a single die, we can see at once that 
such multiplication must be wrong: If we throw one die six times, then his reasoning 
would give for the probability of getting at least one six 6 - (1/6) = 1, and if we 
throw seven times, the probability of at least one six would be 7 - (1/6) > 1; clearly 
impossible. The source of the error lies in the inappropriate use of the additivity 
axiom, since the events A; = “the ith throw yields six,” fori = 1,2, 3,4, are not 
mutually exclusive, and so P(at least one six in four throws of a single die) = P(A; U 
Az U A3 U Ag) is not equal to P(A;)+ P(A2)+ P(A3)+ P(Aq4) = 4- (1/6). 

Similarly, the events B; = “the ith throw yields a double-six for a pair of dice,’ 
fori = 1,2,... , 24, are not mutually exclusive, and so P(at least one double-six in 
24 throws of a pair of dice) = P(B; U Bz U--- U Bog) 4 24- (1/36) = 2/3. 

We could write correct formulas for P(A;UA2UA3U Aq) and P(ByUB2U- - -UB24) 
along the lines of Theorem 3.1.3, but it is easier to compute the required probabilities 
by complementation: P(at least one six in four throws of a single die) = 1— P(no 
six in four throws of a single die) = | — (5/6)* x 0.5177. Similarly, P(at least 
one double-six in twenty-four throws of a pair of dice) = 1— P(no double-six in 
twenty-four throws of a pair of dice) = 1 — (35/36)*4 © 0.4914. o 


In closing this section, let us mention that the marginal probabilities do not de- 
termine the joint probabilities without some assumption like independence, that is, it 
is possible to have different joint probabilities with the same marginals. For instance, 
the joint probability distribution in the following example has the same marginals as 
the one in Example 3.3.3. 


Example 3.3.6 (Another Distribution of Voters). Let the joint distribution of voters in 
a certain city be described by the table below. 


Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 0.05 0.095 0.155 0.30 
30 to 50 0.075 0.21 0.115 0.40 
Over 50 0.125 0.095 0.08 0.30 
Any age 0.25 0.40 0.35 1 


It is easy to check that the various ages and party affiliations are not independent 
of each other. For instance, P(Under 30)P(Republican) = 0.30 - 0.25 = 0.075, while 
P(Under 30 and Republican) = 0.05. 


Exercises 
Exercise 3.3.1. Three dice are thrown. Show that the events A = {the first die shows 


an even number} and B = {the sum of the numbers on the second and third dice is 
even} are independent. 
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Exercise 3.3.2. If b and w stand for the results of a throw of two dice, show that 
the events A = {b+ w < 8} and B = {b = 3 or 4} are statistically independent 
(although it is difficult to see why they should be in the usual sense of the word). 


Exercise 3.3.3. Toss two dice. Let A = {b < 4}, B = {b = 3,4, or 5} and C = 
{b+w = 9}. Show that these events are not independent pairwise, but PPANBNC) = 
P(A)P(B)P(C). 


Exercise 3.3.4. Toss two coins. Let A = {HH, HT}, B = {TH,HH} and C = 
{HT, TH}. Show that these events are independent pairwise, but PAM BNC) £4 
P(A)P(B)P(C). 


Exercise 3.3.5. Let A and B be independent events. Show that 


(a) A and B are also independent, and so are 
(b) A and B. 


Exercise 3.3.6. (a) Can two independent events with nonzero probabilities be mutu- 
ally exclusive? 

(b) Can two mutually exclusive events with nonzero probabilities be independent? 
(Prove your answers.) 


Exercise 3.3.7. A coin is tossed five times. Find the probabilities of obtaining exactly 
0, 1,2, 3,4, and 5 heads, and plot them in a coordinate system. 


Exercise 3.3.8. A die is thrown six times. Find the probabilities of obtaining 


(a) exactly 4 sixes, 
(b) exactly 5 sixes, 
(c) exactly 6 sixes, 
(d) at least 4 sixes, 
(e) at most 3 sixes. 


Exercise 3.3.9. An urn contains 5 red, 5 white, and 5 blue balls. We draw six balls 
independently, one after the other, with replacement. What is the probability of ob- 
taining 2 of each color? 


Exercise 3.3.10. Let A, B, and C be independent events for which AU BUC =S. 
What are the possible values of P(A), P(B), and P(C)? 


Exercise 3.3.11. Let A, B, and C be pairwise independent events and A be indepen- 
dent of B UC. Prove that A, B, and C are totally independent. 


Exercise 3.3.12. Let A, B, and C be pairwise independent events and A be indepen- 
dent of BC. Prove that A, B, and C are totally independent. 
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3.4 Conditional Probabilities 


In this section, we discuss probabilities if certain events are known to have occurred. 
We start by considering two examples. 


Example 3.4.1 (Relative Frequencies in Repeated Tossings of Two Coins). Suppose 
we toss two coins n = 10 times, and observe the following outcomes: HT, TT, HT, 
HH, TT,HH, HH, HT,TH, HT. 

If we denote the event that the first coin shows H by A, and the event that the 
second coin shows H by B, then A occursn 4 = 7 times, B occurs ng = 4 times, and 
AMB occurs n4g = 3 times. The relative frequencies of these events are f4 = 7/10, 
fp =4/10, and fag = 3/10. 

Let us now ask the question: What is the relative frequency of A among the 
outcomes in which B has occurred? Then we must relate the number 1 4g of occur- 
rences of A among these outcomes to the total number ng of outcomes in which B 
has occurred. Thus, if we denote this relative frequency by f4)g, then we have 


nap 3 
fap = — =>. (3.19) 

NB 4 
We call f,4\g the conditional relative frequency of A, given B (or, under the 
condition B). It is very simply related to the old “unconditional” relative frequencies: 


nap/n 3/10 _ fap 
ng/n 4/10 fp” 
According to this example, we would want to define conditional probabilities 


in an analogous manner by P(A|B) = P(AB)/P(B), for any events A and B with 
P(B) # 0. Indeed, this is what we shall do, but let us see another example first. 


fa\B = (3.20) 


Example 3.4.2 (Conditional Probabilities for Randomly Picked Points). Assume that 
we pick a point at random from those shown in Figure 3.2. If P(A), P(B) and P(AB) 
denote the probabilities of picking the point from A, B and A B respectively, then 
P(A) = 5/10, P(B) = 4/10, P(AB) = 3/10. If we restrict our attention to only those 
trials in which B has occurred, that is, if we know that the point has been picked from 


A B 
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B, then obviously we want to define the conditional probability P(A|B) of A given 
B as 3/4, that is, as P(A|B) = P(AB)/P(B) again. 


These examples lead us to 


Definition 3.4.1 (Conditional Probability). Let A and B be arbitrary events in a 
given sample space, with P(B) # 0. Then we define the conditional probability of 
A, given B,as 

P(AB) 

P(B) © 


P(A|B) = (3.21) 
Notice that actually every probability may be regarded as a conditional probabil- 
ity, with the condition S, since 


P(AS) _ P(A) _ 


PAIS = Foy = PS) > 


P(A). (3.22) 
Conversely, every conditional probability P(A|B) may be regarded as an uncon- 
ditional probability in a new, reduced sample space, namely in B, in place of S. 
(This fact is clearly true in sample spaces with equally likely outcomes, as in Exam- 
ple 3.4.2 but, in general, it needs to be proved from Definition 3.4.1. It will be the 
subject of Theorem 3.4.1 below.) 
Let us see some further examples. 


Example 3.4.3 (Two Dice). Two dice are thrown. What is the probability that the sum 
of the numbers that come up is 2 or 3, given that at least one die shows a 1? 

Let us call these events A and B, that is, let A = {(1, 1), C, 2), 2, 1)} and 
B = {0 1), 0,2), 0, 3), 0,4), (5), 0, 6), 2, D, G, 1), 4 1), G, 1), ©, D}- 
Then AB = A, and so P(AB) = 3/36, P(B) = 11/36 and, by the definition of 
conditional probabilities, P(A|B) = (3/36)/(11/36) = 3/11. 

We could also have obtained this result directly, as the unconditional probability 
of the three-point event A in the eleven-point sample space B. 


Warning: We must be careful not to confuse the probability P(AB) of A and B oc- 
curring jointly (or as we say, their joint probability) with the conditional probability 
P(A|B). In the above example, for instance, it would be incorrect to assume that the 
probability of the sum being 2 or 3, if one die shows a 1, is 3/36 since, under that 
condition, the 3 favorable cases must be related to a total of 11 cases, rather than to 
all 36. 


Example 3.4.4 (Sex of Children in Randomly Selected Family). From all families 
with three children, we select one family at random. What is the probability that the 
children are all boys, if we know that a) the first one is a boy, and b) at least one is 
a boy? (Assume that each child is a boy or a girl with probability 1/2, independently 
of each other.) 

The sample space is S = {bbb, bbg, bgb, bgg, ghb, ghg, ggb, ggg} with 8 
equally likely outcomes. The sample points are the possible types of families, with 
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the children listed in the order of their births; for instance, bgg stands for a family in 
which the first child is a boy and the other two are girls. 

The reduced sample space for Part a) is {bbb, bbg, bgb, bgg}, and so P(all are 
boys | the first one is a boy) = 1/4. Similarly, the reduced sample space for Part 
b) is {bbb, bbg, bgb, bgg, gbb, gbg, ggb}, and so P(all are boys | at least one is a 
boy) = 1/7. 


It may seem paradoxical that the two answers are different. After all, if we know 
that one child is a boy, what does it matter whether it is the first one we know this 
about or about any one of the three? But in the first case we know more: we know not 
just that one child is a boy, but also that it is the first one who is a boy. Thus, in the first 
case the reduced sample space is smaller than in the second case, and consequently 
the denominator of the conditional probability is smaller, while the numerator is the 
same. 


Example 3.4.5 (The Sex of a Sibling of a Randomly Selected Child). From all families 
with two children, we select one child at random. If the selected child is a boy, what 
is the probability that he comes from a family with two boys? (Assume that each 
child is a boy or a girl with probability 1/2, independently of each other.) 

The main difference between this example and the preceding one is that there we 
selected a family and here we select a child. Thus, here the sample points must be 
children, not families. We denote the child to be selected by b or g, but we also want 
to indicate the type of family he or she comes from. So, denoting the other child by b 
or g, we write, for instance, bb for a boy with a younger brother, gb for a boy with an 
older sister, etc. Thus, we use the sample space S = {bb, bg, gb, gg, bb, gb, bg, gg} 
with 8 equally likely outcomes, which denote the eight different types of child that 
can be selected. The reduced sample space for which the selected child is a boy is 
{bb, bg, bb, gb}, and so P(both children of the family are boys | the selected child is 
aboy) = 2/4 = 1/2. 

We may also solve this problem by ignoring the birth order. Then S = {bb, bg, 
gb, gg}, where bb stands for a boy with a brother, bg for a boy with a sister, etc. Now 
the reduced sample space is {bb, bg}. Hence P(both children of the family are boys 
| the selected child is a boy) = 1/2 again. 4 


The definition of conditional probability is often used in the multiplicative form 
P(AB) = P(A|B)P(B) (3.23) 


for the assignment of probabilities to joint events, much as we used the definition of 
independence for that purpose. Let us show this use in some examples. 


Example 3.4.6 (Dealing two Aces). Two cards are dealt without replacement from a 
regular deck of 52 cards. Find the probability of getting two Aces. 

Letting A = {the second card is an Ace} and B = {the first card is an Ace}, we 
have P(B) = 4/52 and P(A|B) = 3/51, because, after having dealt the first card, 
there are 3 aces and a total of 51 cards left. Hence P(both cards are Aces) = P(AB) = 
P(A|B) P(B) = (4/52) - (3/51). 
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In several of the preceding examples, we saw that conditional probabilities be- 
have like unconditional probabilities on a reduced sample space. The following the- 
orem shows that indeed they satisfy the three axioms of probabilities ° 


Theorem 3.4.1 (For a Fixed Condition, Conditional Probabilities Satisfy the Ax- 
ioms of Probabilities). Let B be an event with nonzero probability in a sample space 
S. The conditional probabilities under the condition B have the following properties: 


1. P(A|B) => 0 for every event A, 

2. P(S|B) = 1, 

3. P(A, U A2 U---|B) = P(A{|B)+ P(A2|B) +--+ for any finite or countably 
infinite number of mutually exclusive events Aj, A2,... . 


Proof. 1. Inthe definition of P(A|B) the numerator is nonnegative by Axiom |, and 
the denominator is positive by assumption. Thus, the fraction is nonnegative. 
2. Taking A = S in the definition of P(A|B), we get 


PISO B)  P(B) | 


P(S|B) = a oa (3.24) 
_ P(ABU A BU---) 
7 P(B) 
_ P(A1B) + P(A2B) ++ 
7 P(B) 
= P(A;|B) + P(A2|B) +--- (3.25) 


where the next to last equality followed from Axiom 3 and Definition 3.4.1.’ 


Corollary 3.4.1. If the events A and A, A2,... . are subsets of B, then for fixed B 
the function P(-|B) is a probability measure on the reduced sample space B in place 
of S. 


The definition of conditional probabilities leads to an important test for indepen- 
dence of two events: 


© This theorem does not quite make P(A|B) for fixed B into a probability measure on B in 
place of S though, because in Definition 3.1.2, P(A) was defined for events A C S, but in 
P(A|B) we do not need to have A C B. See the Corollary, however. 

7 Because of this theorem, some authors use the notation Pg(A) for P(A|B) to emphasize 
the fact that Pg is a probability measure on S and in P(A|B) we do not have a function of 
a conditional event A|B but a function of A. In other words, P(A|B) = (the probability of 
A) given B, and not the probability of (A given B). Conditional events have been defined 
but have not gained popularity. 
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Theorem 3.4.2 (A Condition for Independence). Two events A and B, with P(B) # 
0, are independent, if and only if 


P(A|B) = P(A). (3.26) 
Proof. By Definition 3.3.1 two events are independent, if and only if 
P(AB) = P(A)P(B). (3.27) 


Substituting into the left-hand side of this equation from Equation 3.23, we equiva- 
lently have that, when P(B) ¥ 0 (the conditional probability P(A|B) is defined only 
if P(B) #0), 


P(A|B)P(B) = P(A)P(B) (3.28) 
or, by cancelling P(B), 

P(A|B) = P(A). (3.29) 

a 


Note that the condition in 3.4.2 is asymmetric in A and B, but if P(A) 4 0, then 
we could similarly prove that A and B are independent, if and only if 


P(B|A) = P(B). (3.30) 


Exercises 


Exercise 3.4.1. Suppose the following sequence of tosses of two coins is observed: 
HH,TT,HT,TT,TH, HT,HT,HT,TH,TT,TH,HT,TT,TH, HH, TH, TT, HH, 
HT,TH. 

Let A = {the first coin shows H} and B = {the second coin shows T}. 


(a) Find the relative frequencies fa, fg, fap. fajp and fpja. 
(b) Find the corresponding probabilities P(A), P(B), P(AB), P(A|B), P(B|A). As- 
sume that the coins are fair and the tosses independent. 


Exercise 3.4.2. Two dice are thrown, with b and w denoting their outcomes. (See 
Figure 1.4 on page 12.) Find Pw < 3 andb+w = 7), Pw < 3|b+w = 7) and 
Pb + w=7|w <3). 


Exercise 3.4.3. A card is drawn at random from a deck of 52 cards. What is the 
probability that it is a King or a 2, given that it is a face card (J, Q, K)? 


Exercise 3.4.4. In Example 3.3.6 voters of a certain district are classified according 
to age and party registration (for example, the .05 in the under 30 and Republican 
category means that 5% of the total is under 30 and Republican, that is P({under 
30} MN {Republican}) = 0.05) for a randomly selected voter. Find the probabilities of 
a voter being 
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(a) Republican, 

(b) under 30, 

(c) Republican if under 30, 
(d) under 30 if Republican, 
(e) Democrat, 

(f) Democrat if under 30, 
(g) Independent, 

(h) Independent if under 30. 


Exercise 3.4.5. In the previous problem, the sum of the answers to (c), (f), and (h) 
should be 1. Why? 


Exercise 3.4.6. Consider two events A and B with P(A) = 8/10 and P(B) = 9/10. 
Prove that P(A|B) > 7/9. 


Exercise 3.4.7. From a family of three children, a child is selected at random and 
is found to be a girl. What is the probability that she came from a family with two 
girls and one boy? (Assume that each child is a boy or a girl with probability 1/2, 
independently of one another.) 


Exercise 3.4.8. Three dice were rolled. What is the probability that exactly one six 
came up if it is known that at least one six came up. 


Exercise 3.4.9. Two cards are drawn at random from a deck of 52 cards without 
replacement. What is the probability that they are both Kings, given that they are 
both face cards (J, Q, K)? 


Exercise 3.4.10. Prove that any two events A and B, with P(B) 4 0 and P(B) 40, 
are independent of each other if and only if P(A|B) = P(A|B). 


Exercise 3.4.11. Two cards are drawn at random from a deck of 52 cards without 
replacement. What is the probability that exactly one is a King, given that at most 
one is a King? 


Exercise 3.4.12. Two cards are drawn at random from a deck of 52 cards with re- 
placement. What is the probability that exactly one is a King, given that at most one 
is a King? 


Exercise 3.4.13. A 13-card hand is dealt from a standard deck of 52 cards. What is 
the probability that 


(a) it contains no spades if it contains exactly 5 hearts, 
(b) it contains at least one spade if it contains exactly 5 hearts? 
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3.5 The Theorem of Total Probability and the Theorem of Bayes 


In many applications, we need to combine the definition of conditional probabilities 
with the additivity property, as in the following examples. 


Example 3.5.1 (Picking Balls from Urns.). Suppose we have two urns, with the first 
one containing 2 white and 6 black balls, and the second one containing 2 white and 
2 black balls. We pick an urn at random, and then pick a ball from the chosen urn at 
random. What is the probability of picking a white ball? 

Let us denote the events that we choose urn | by U; and urn 2 by U2, and that we 
pick a white ball by W and a black ball by B. We are given the probabilities P(U;) = 
P(U2) = 1/2, since this is what it means that an urn is picked at random; and, given 
that urn | is chosen, the random choice of a ball gives us the conditional probability 
P(W|U,) = 2/8, and similarly P(W|U2) = 2/4. Then, by Formula 3.23, 


2 1 1 
P(W NU) = P(WIU1)P(U 1) = Sa et (3.31) 
and 
2 1 1 
P(W ON U2) = P(W|U2)P(U2) = ra — i (3.32) 


Now, obviously, W is the union of the disjoint events WM U; and WM U2, and 

so by the additivity of probabilities 
1,1. 3 
POW) = POW U1) + POW U2) = at 3" (3.33) 

Note that this result is not the same as that which we would get if we were to put 
all 12 balls into one urn and pick one at random. Then we would get 4/12 = 1/3 for 
the probability of picking a white ball. 

In problems such as this one, it is generally very helpful to draw a tree diagram, 
with the given conditional probabilities on the branches, as indicated in Figure 3.3. 

The unconditional probabilities of each path from top to bottom are obtained by 
multiplying the conditional probabilities along the path. For example the probability 


1/2 1/2 
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of the path through U; and B is P(U; N B) = (1/2) - (6/8) = 3/8, and similarly, 
P(U2 0 B) = (1/2) - (2/4) = 1/4. 

The probability of obtaining a given end-result, regardless of the path, is the sum 
of the probabilities of all paths ending in that result. Thus P(B) = 3/8+ 1/4 = 5/8. 


The method just shown can be used in situations involving any number of alter- 
natives and stages, whenever the conditional probabilities are known and we want to 
find the unconditional probabilities. 


Example 3.5.2 (Dealing Three Cards). From a deck of 52 cards three are drawn with- 
out replacement. What is the probability of the event E of getting two Aces and one 
King in any order? 

If we denote the relevant outcomes by A, K and O (for “other’’), then we can 
illustrate the experiment by the tree in Figure 3.4. 

The event E is the union of the three elementary events AAK, AK A, and K AA. 
The relevant conditional probabilities have been indicated on the corresponding 
paths. (The rest of the diagram is actually superfluous for answering this particu- 
lar question.) Now 


4 3 4 2 


P(AAK) = . i = , (3.34) 
52 51 50 5525 
4 4 3 2 
P(AKA) = . ‘ = : (3.35) 
52 51 50 5525 
and 
4 4 3 2 
P(KAA) = ; : = ; (3.36) 
52 51 50 5525 
Thus 
6 
P(E) = P(AAK) + P(AKA) + P(KAA) = —— 0.11%. (3.37) 


5525 


Let us explain the reasons for these calculations: P(Ace first) = 4/52, since there 
are 4 Aces and 52 cards at the beginning. P(Ace second | Ace first) = 3/51, since 


4/52 44/52 
4/52 
A K O 
3/51 44/51 4/51 Wee 
4/51 3/51 
A K O A K O A K O 


Fig. 3.4. 
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after drawing an Ace first, we are left with 3 Aces and 51 cards. Then, from the 
definition of conditional probabilities, 


P(Ace first and Ace second) = P(Ace first)P(Ace second|Ace first) 
ge (3.38) 
~ 52 Si 


After drawing two Aces we have 4 Kings and 50 cards left, hence P(King third | 
Ace first and Ace second) = 4/50. Then, again from the definition of conditional 
probabilities, 


P(AAK) = P(Ace first and Ace second and King third) 
= P(Ace first and Ace second)P(King third | Ace first and Ace second) 
4 3 4 2 
~ 52°51 50 5525” 
which is the same as our previous value for P(AAK). Now P(AKA) and P(KAA) can 


be obtained in a similar manner and, since these are the probabilities of mutually 
exclusive events whose union is F,, we obtain P(E) as their sum. ¢ 


(3.39) 


The previous examples illustrate two general theorems: 


Theorem 3.5.1 (Joint Probability of Three Events). For any three events A, B and 
C with nonzero probabilities we have 


P(ABC) = P(A|BC)P(B|C)P(C). (3.40) 
We leave the proof to the reader. 


Theorem 3.5.2 (The Theorem of Total Probability). /f B,, B2,..., By are mu- 
tually exclusive events with nonzero probabilities, whose union is B, and A is any 
event, then 


P(AB) = P(A|B1)P(B1) + P(A|B2)P(B2) + +++ + P(A|Bn)P(Bn). (3.41) 
Proof. Applying Equation 3.23 to each term on the right above, we get 


P(A|B,)P(B1) + P(A|B2)P(B2) +--+ + P(A|Bn)P(Bn) 
= P(AB) + P(AB2) + -- + + P(ABn) = P(AB). (3.42) 


The last sum equals P(AB), since 
(AB,) U (AB2) U---U (ABy) = A(B, U Bo U---U B,) = AB (3.43) 
and the AB;’s are mutually exclusive since the B;’s are, that is, 
(AB;)(AB;) = A(B; Bj) = A’ = 9% (3.44) 


for any pair B;, Bj with i # j. | 
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If B = S or B D A, then AB = A and the theorem reduces to the following 
special case: 


Corollary 3.5.1. If A is any event and B,, B2,... , Bn are mutually exclusive events 
with nonzero probabilities, whose union is S or contains A, then 


P(A) = P(A|B,)P(B1) + P(A|B2)P(B2) +--+ + P(A|Bn)P(Bn). (3.45) 


Example 3.5.3 (Second Card in a Deal). From a well-shuffled deck of 52 cards we 
deal out two cards. What is the probability that the second card is a spade? 

We present two solutions. 

First, letting S; denote the event that the first card is a spade, and S2 the event 
that the second one is a spade, Corollary 3.5.1 gives 


12 13 13 39 #1 
51°52 51°52 4 

On the other hand, we could have argued simply that the second card in the deck 
has just as much chance of being a spade as the first card, if we do not know whether 
the first card is a spade or not. Similarly, the probability that the nth card is a spade 
is also 1/4 for any n from | to 52, since we may cut the deck above the nth card, and 
start dealing from there. 


P(S2) = P(S2|$1)P(S1) + P(S2|S1)P(S1) = (3.46) 


Example 3.5.4 (Suit of Cards Under Various Conditions). From a deck of cards two 
are dealt without replacement. Find the probabilities that 


(a) both are clubs, given that the first one is a club, 
(b) both are clubs, given that one is a club, 

(c) both are clubs, given that one is the Ace of clubs, 
(d) one is the Ace of clubs, given that both are clubs. 


(a) Clearly, P(both are clubs | the first one is a club) = P(second card is a club | 
the first one is a club) = 12/51 = 4/17. 

(b) In this case the possible outcomes are {CC, C C,CC, CC}, with C denoting 
a club and C a non-club, and the first letter indicating the first card and the second 
letter the second card. The condition that one card is a club means that we know that 
one of the two cards is a club but the other can be anything or, in other words, that 
at least one of the two cards is a club. Thus, P(one is a club) = P(CC, CC, CC) = 
(1/4) - 2/51) + 1/4) - 39/51) + (3/4) - (13/51). and so 


P(both are C | one is C) = P(CC|ICC UCC UCC) 

1 12 

z 4 51 _ 2 

~ 1 12 1 39 3 «13 45° Gal) 


Ao a St a 


8 We usually omit the braces or union signs around compound events when there are 
already parentheses there, and separate the components with commas. Thus we write 
P(CC, CC, CC) rather than P({CC, CC, CC}) or P(CC UCC UCC). 
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Another way of computing this probability is by using a reduced sample space: 
There are 13-51-+39- 13 ordered ways of dealing at least one club, because the first 
card can be aclub 13 ways and then the second card can be any one of the remaining 
51 cards, or the first card can be anything other than a club in 39 ways, but then the 
second card must be one of the 13 clubs. Also, there are 13 - 12 ways of dealing two 
clubs and {both are C} M {one is C} = {both are C}. Thus, 


(ee _2 
13-51+39-13 15 


P(both are C | one is C) = ; (3.48) 
as before. 

It may seem surprising that the answers to parts (a) and (b) are not the same. After 
all, why should it make a difference whether we know that the first card is a club, or 
just that one of the cards is a club? The answer is that the conditions are different: 
in case (a) we computed P(CC|CC UCC) = [(1/4) - (12/51)]/[/4) - 12/51) + 
(1/4) - (39/51)] = 4/17, whereas in case (b) we computed P(CC|CC UCC UCC). 

(c) Again, at first glance it may seem paradoxical that it makes a difference 
whether we know that one of the cards is the Ace of clubs, or just any club but, 
as we shall see, we are talking here of a different event under a different condition. 

Computing with the reduced sample space, we have 1 - 51 + 51 - 1 ordered ways 
of dealing the Ace of clubs, because the first card can be the Ace of clubs in just 1 
way and then the second card can be any one of the remaining 51 cards, or the first 
card can be other than the Ace of clubs in 51 ways but then the second card must be 
the Ace of clubs. Similarly, there are 1 - 12 + 12- 1 ways of dealing two clubs, one 
of which is the Ace, and so 


1-124+12-1_ 4 


P(both is the AC) = ~ 49° ae 
(both are C | one is the AC) 1-514+51-1 17 om 
(d) In this case, 
Pi is the AC and both Cc 
P(one is the AC | both are C) = os ia = a —— 
1 12 " 12 1 
ee a 
= 13,12 ~ 23 ie 
52 51 


Example 3.5.5 (The Gambler’s Ruin). A gambler who has m > 0 dollars, bets 1 
dollar each time on H in successive tosses of a coin, that is, he wins or loses 1 dollar 
each time, until he ends up with n dollars, for some n > m, or runs out of money. 
Find the probability of the gambler’s ruin, assuming that his opponent cannot be 
ruined. 

Let A,, denote the event that the gambler with initial capital m is ruined. Then, 
if he wins the first toss, he has m + 1 dollars, and the event of ruin in that case is 
denoted by A,,41. That is, P(A,,|H) = P(Am4+1), where H denotes the outcome of 
the first toss. Similarly, P(A»|T) = P(Am_1). 
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On the other hand, by the corollary, 
P(Am) = P(Am|H)P(A) + P(Am|T)P(T) forO <m <n, (3.51) 


which can then also be written as 
1 1 
P(Am) = P(Am+1) - 5 + P(Am_1)- 5 forO <m <n. (3.52) 


If we regard P(A,,) as an unknown function / (mm), then this type of equation 
is called a difference equation, and is known to have the general solution f(m) = 
a + bm, where a and b are arbitrary constants. (We shall deduce this fact in Section 
5.3, but we need many other facts first.) For a particular solution, these constants 
can be determined by initial or boundary conditions. In the present case, obviously 
P(Ao) = | and P(A,) = 0. Hence a + b0 = 1 anda + bn = 0, which give a = | 
and b = —1/n. Thus, the probability of the gambler’s ruin is 


Pasta. (3.53) 
n 


This formula is indeed very reasonable. It shows, for instance, that ifn = 2m, 
that is, that the gambler wants to double his money, then both the probability of ruin 
and the probability of success are 1/2. Similarly, if the gambler wants to triple his 
money, that is, = 3m, then the probability of ruin is 2/3. Generally, the greedier he 
is, the larger the probability of ruin. 


Example 3.5.6 (Laplace’s Rule of Succession). The great eighteenth century French 
mathematician Laplace used the following very interesting argument to estimate the 
chances of the sun’s rising tomorrow. 

Let sunrises be independent random events with an unknown probability p of 
occurrence. Let N be a large positive integer, B; = “the probability p is i/N,” for 
i=0,1,2,...,N,and A = “the sun has risen every day for n days,’ where Laplace 
took n to be 1,826,213 days, which is 5,000 years. He assumed, since we have no 
advance knowledge of the value of p, that the possible values are equally likely and 
so P(B;) = 1/(N + 1) for each i. By the assumed independence, 


i n 
P(A|B;) = (=) : (3.54) 
Hence, by the theorem of total probability, 
N 1 i\? 
PA = rey (5) - G.55) 


Similarly, if B = “the sun has risen for n days and will rise tomorrow,” then 


N 1 i n+l 
P(B) = —— (<) : (3.56) 
~N+1\N 
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Consequently, 


P(AB) _ P(B) 
P(A) P(A)’ 


P(B|A) = (3.57) 


For large values of N , the sums can be simplified by noting that yal /N\i/N)" 
is a Riemann sum for the integral i x"dx = 1/(n + 1). Therefore 


P(A) © a : (3.58) 
“N+1 n+l’ 
and 
N 1 
P(B) + —— - ; 3.59 
(B) N+1 n+2 ( ) 


Thus, the probability that the sun will rise tomorrow, if it has risen every day for 
n days is 


n+1 


P(BIA) © ——_. 


(3.60) 


For n = 1,826,213 this result is indeed very close to 1. Unfortunately, how- 
ever, the argument is on shaky grounds. First, it is difficult to see sunrises as random 
events. Second, why would sunrises on different days be independent of each other? 
Third, just because we do not know a probability, we cannot assume that it has a 
random value equally likely to be any number from 0 to 1. In the eighteenth century, 
however, probability theory was in its infancy, and its foundations were murky. Set- 
ting aside the application to the sun, we can easily build a model with urns and balls 
for which the probabilities above provide an accurate description. 4 


The next theorem is a straightforward formula based on the definition of con- 
ditional probabilities and the theorem of total probability. It is important because 
it provides a scheme for many applications. Before discussing the general formula, 
however, we start with a simple example. 


Example 3.5.7 (Which Urn Did a Ball Come From?). We consider the same experi- 
ment as in Example 3.5.1, but ask a different question: We have two urns, with the 
first one containing 2 white and 6 black balls, and the second one containing 2 white 
and 2 black balls. We pick an urn at random, and then pick a ball from the chosen 
urn at random. We observe that the ball is white and ask: What is the probability that 
it came from Urn 1, 1.e., that in the first step we picked Urn 1? 

With the notation of Example 3.5.1, we are asking for the conditional probability 
P(U,|W). This probability can be computed as follows: 


P(U;|W) = 


= — - (3.61) 
P(W|Ui)POi) + POWIUa)PUa) 2 3 


2 
PW|U1) PU) e. 
I 
8 2 
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The general scheme that this example illustrates is this: we have several possible 
outcomes of an experiment, like U; and U2 of the first stage above, and we observe 
the occurrence of some other event like W. We ask then the question: What are the 
new probabilities of the original outcomes in light of this observation? The answer 
for the general case is given by 


Theorem 3.5.3 (Bayes’ Theorem). /f A is any event and B,, Bz,... , By, are mu- 
tually exclusive events with nonzero probabilities, whose union is S or contains A, 
then 


7 P(A|Bi)P(Bi) 
~ P(A|B))P(Bi) + P(A|B2)P(B2) +--+ P(A|B,)P(Bn) 


P(B;|A) (3.62) 


fori =1,2,...,n. 


Example 3.5.8 (A Blood Test). A blood test, when given to a person with a certain 
disease, shows the presence of the disease with probability .99, and fails to show it 
with probability .01. It also produces a false positive result for healthy persons, with 
probability .02. We also know that .1% of the population has the disease. What is the 
probability that a person really has the disease if the test says so? 

We use Bayes’ theorem for a randomly selected person, with B, = “the person 
has the disease,’ B2 = “the person does not have the disease,’ and A = “the test gives 
a positive result.’ Then we are looking for P(B;|A), and we know that P(A|B,) = 
0.99, P(B,) = 0.001, P(A|B2) = 0.02, and P(B2) = 0.999. Hence, 


0.99 - 0.001 99 
0.99 -0.001 + 0.02 - 0.999 — 99 + 1998 


P(Bi|A) = = 0.047. (3.63) 

Thus, the probability that a person really has the disease if the test says so turns 
out to be less than 5%. This number is unbelievably low. After all, the test is 99 or 98 
percent accurate, so how can this be true? The explanation is this: The positive test 
result can arise two ways. Either it is a true positive result, that is, the patient has the 
disease and the test shows it correctly, or it is a false positive result, that is, the test 
has mistakenly diagnosed a healthy person as diseased. Now, because the disease is 
very rare (only one person in a thousand has it), the number of healthy persons is 
relatively large, and so the 2% of them who are falsely diagnosed as diseased still far 
outnumber, 1998 to 99, the correctly diagnosed, diseased people. Thus, the fraction 
of correct positive test results to all positive ones is small. 

The moral of the example is that for a rare disease we need a much more accurate 
test. The probability of a false positive result must be of a lower order of magnitude 
than the fraction of people with the disease. On the other hand, the probability of a 
false negative result does not have to be so low; it just depends on how many diseased 
persons we can afford to miss, regardless of the rarity of the disease. 4 


Bayes’ theorem is sometimes described as a formula for the probabilities of 
“causes.” In the above example, for instance, B; and Bz may be considered the two 
possible causes of the positive test result. The probabilities P(B,) and P(B2) are 


68 3 Probabilities 


called the prior probabilities of these causes, and P(B;|A) and P(B2|A) their poste- 
rior probabilities , because they represent the probabilities of B, and Bz before and 
after the consideration of the occurrence of A. The terminology of “causes” is, how- 
ever, misleading in many applications where no causal relationship exists between A 
and the B;. 

Although Bayes’ theorem is certainly true and quite useful, it has been contro- 
versial because of philosophical problems with the assignment of prior probabilities 
in some applications. 


Exercises 


Exercise 3.5.1. In an urn there are | white and 3 black balls, and in a second urn 3 
white and 2 black balls. One of the urns is chosen at random and then a ball is picked 
from it at random. 


(a) Illustrate the possibilities by a tree diagram. 
(b) Find the branch probabilities. 
(c) Find the probability of picking a white ball. 


Exercise 3.5.2. Given two urns with balls as in the previous problem, we choose an 
urn at random and then we pick two balls from it without replacement. 


(a) Illustrate the possibilities with a tree diagram. 
(b) Find the branch probabilities. 
(c) Find the probability of picking a white and a black ball (in any order). 


Exercise 3.5.3. From a deck of cards, two are drawn without replacement. Find the 
probabilities that 


(a) both are Aces, given that one is an Ace, 

(b) both are Aces, given that one is a red Ace, 

(c) both are Aces, given that one is the Ace of spades, 
(d) one is the Ace of spades, given that both are Aces. 


Exercise 3.5.4. Modify the Gambler’s Ruin problem as follows: Suppose there are 
two players, Alice and Bob, who bet on successive flips of a coin until one of them 
wins all the money of the other. Alice has m dollars and bets one dollar each time 
on H, while Bob has n dollars and bets one dollar each time on T. In each play the 
winner takes the dollar of the loser. Find the probability of ruin for each player. 


Exercise 3.5.5. Modify the Gambler’s Ruin problem by changing the probability of 
winning from 1/2 to p in each trial. (Hint: Modify Equation 3.5.22, and try to find 
constants A such that P(A,,) = 4” for0 < m <n.The general solution should be of 
the form P(A,,) = aa‘' + bd5', and the constants a and b are to be determined from 
the boundary conditions.) 
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Exercise 3.5.6. In a mythical kingdom, a prisoner is given two urns and 50 black 
and 50 white marbles. The king says that the prisoner must place all the marbles 
in the urns with neither urn remaining empty, and he will return later and pick an 
urn and then a marble from it at random. If the marble is white, the prisoner will be 
released, but if it is black, he will remain in jail. How should the prisoner distribute 
the marbles? Prove that your answer indeed maximizes the prisoner’s chances of 
going free. 


Exercise 3.5.7. In an urn there are | white and 3 black balls, and in a second urn 3 
white and 2 black balls as in Exercise 3.5.1. One of the urns is chosen at random 
and then a ball is picked from it at random and turns out to be white. What is the 
probability that it came from Urn 1? 


Exercise 3.5.8. Given two urns with balls as in the previous problem, we choose an 
urn at random and then we pick two balls from it without replacement. (Also see 
Exercise 3.5.2.) What is the probability that the two balls came from Urn | if they 
have different colors? 


Exercise 3.5.9. From all families with two children, one family is selected at random 
and then a child is selected from it at random and is found to be a girl. What is the 
probability that she came from a family with two girls? (Assume that each child 
is a boy or a girl with probability 1/2, independently of one another.) Use Bayes’ 
theorem. 


Exercise 3.5.10. From all families with three children, one family is selected at ran- 
dom and then a child is selected from it at random and is found to be a girl. What 
is the probability that she came from a family with two girls and one boy? (Assume 
that each child is a boy or a girl with probability 1/2, independently of one another.) 
Use Bayes’ theorem. 


Exercise 3.5.11. Given two urns with balls as in Exercise 3.5.1, we choose a ball 
from each urn. If one ball is white and the other black, what is the probability that 
the white ball came from Urn 1? 


Exercise 3.5.12. On a multiple-choice question with five choices, a certain student 
either knows the answer and then marks the correct choice, or does not know the 
answer and then marks one of the choices at random. What is the probability that he 
knew the answer if he marked the correct choice? Assume that the prior probability 
that he knew the answer is 3/4. 


Exercise 3.5.13. Keith Devlin attributes this problem to Amos Tversky:? Imagine 
you are a member of a jury judging a hit-and-run case. A taxi hit a pedestrian one 
night and fled the scene. The entire case against the taxi company rests on the evi- 
dence of one witness, an elderly man, who saw the accident from his window some 


9 “Tversky’s Legacy Revisited?’ by Keith Devlin, www.maa.org/devlin/devlin_july.html, 
1996. 
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distance away. He says that he saw the pedestrian struck by a blue taxi. In trying 
to establish her case, the lawyer for the injured pedestrian establishes the following 
facts: 


1. There are only two taxi companies in town, “Blue Cabs’ and ‘Black Cabs.’ On 
the night in question 85% of all taxies on the road were black and 15% were 
blue. 

2. The witness has undergone an extensive vision test under conditions similar to 
those on the night in question, and has demonstrated that he can successfully 
distinguish a blue taxi from a black taxi 80% of the time. 


If you were on the jury, how would you decide? 


4 


Random Variables 


4.1 Probability Functions and Distribution Functions 


In many applications, the outcomes of probabilistic experiments are numbers or have 
some numbers associated with them, which we can use to obtain important informa- 
tion, beyond what we have seen so far. We can, for instance, describe in various ways 
how large or small these numbers are likely to be and compute likely averages and 
measures of spread. For example, in 3 tosses of a coin, the number of heads obtained 
can range from 0 to 3, and there is one of these numbers associated with each possi- 
ble outcome. Informally, the quantity “number of heads” is called a random variable, 
and the numbers 0 to 3 its possible values. In general, such an association of numbers 
with each member of a set is called a function. For most functions whose domain is 
a sample space, we have a new name: 


Definition 4.1.1 (Random Variable). A random variable (abbreviated r.v.) is a real- 
valued function on a sample space. 


Random variables are usually denoted by capital letters from the end of the al- 
phabet, such as X, Y, Z, and related sets like {s : X(s) = x}, {s : X(s) < x}, and 
{s : X(s) € I}, for any number x and any interval J, are events! in S. They are usu- 
ally abbreviated as {X = x},{X <x}, and {X e€ J} and have probabilities associated 
with them. The assignment of probabilities to all such events, for a given random 
variable X, is called the probability distribution of X. Furthermore, in the notation 
for such probabilities, it is customary to drop the braces, that is, to write P(X = x), 
rather than P({X = x}), etc. 

Hence, the preceding example can be formalized as: 


Example 4.1.1 (Three tosses of a coin). Let S = {HHH, HHT, HTH, HTT, THH, 
THT, TTH, TTT} describe three tosses of a coin, and let X denote the number of 


! Actually, in infinite sample spaces there exist complicated functions for which not all such 
sets are events, and so we define a r.v. as not just any real-valued function X, but as a so- 
called measurable function, that is, one for which all such sets are events. We shall ignore 
this issue; it is explored in more advanced books. 
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heads obtained. Then the values of X, for each outcome s in S, are given in the 
following table: 


s: HHH | HHT | HTH | HTT | THH | THT | TTH | TIT 
X(s): | 3 2 2 1 2 1 1 0 


Thus, in the case of three independent tosses of a fair coin, P(X = 0) = 1/8, 
P(X = 1) = 3/8, P(X = 2) = 3/8, and P(X = 3) = 1/8. 


The following functions are generally used to describe the probability distribu- 
tion of a random variable: 


Definition 4.1.2 (Probability Function). For any probability space and any random 
variable X on it, the function f(x) = P(X = x), defined for all possible values” x 
of X, is called the probability function (abbreviated p-f.) of X. 


Definition 4.1.3 (Distribution Function). For any probability space and any random 
variable X on it, the function F(x) = P(X < x), defined for all real numbers x, is 
called the distribution function (abbreviated df.) of X. 


Example 4.1.2 (Three tosses of a coin, continued). Let X be the number of heads 
obtained in three independent tosses of a fair coin, as in the previous example. Then 
the p.f. of X is given by 


1/8 ifx=0 
3/8 ifx=1 
x)= 4.1 
Ie) 3/8 ifx =2 wae 
1/8 ifx =3 
and the df. of X is given by 
0 ifx <0 
1/8 if0<x <1 
F(x) = 44/8 ifl<x <2 (4.2) 
7/8 if2<x <3 
1 ifx > 3. 


The graphs of these functions are shown in Figures 4.1 and 4.2 below. 

It is also customary to picture the probability function by a histogram, which is a 
bar-chart with the probabilities represented by areas. For the X above, this is shown 
in Figure 4.3. (In this case, the bars all have width one, and so their heights and areas 
are equal.) 


2 Sometimes f (x) is considered to be a function on all of R, with f(x) = 0 if x is not a 
possible value of X. This is a minor distinction, and it should be clear from the context 
which definition is meant. If 0 values are allowed for f, then the set {x : f(x) > 0} is 
called the support of f. 
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y 
3/8 e ° 
1/8 ¢ e 
) 1 2 3 4 x 
Fig. 4.1. Graph of the p.f. f of a binomial random variable with parameters n = 3 and 
p=1/2. 
ya 
1 + ——_— 
718 } ——_ 
1/2 + -— 
1/8 #— 
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Fig. 4.2. Graph of the df. F of a binomial random variable with parameters n = 3 and 
p=1/2. 


Fig. 4.3. Histogram of the p-f. of a binomial random variable with parameters n = 3 and 
p=1/2. 


74 4 Random Variables 


Certain frequently occurring random variables and their distributions have spe- 
cial names. Two of these are generalizations of the number of heads in the above 
example. The first one is for a single toss, but with a not necessarily fair coin, and 
the second one for an arbitrary number of tosses. 


Definition 4.1.4 (Bernoulli Random Variables). A random variable X is called a 
Bernoulli random variable with parameter p, if it has two possible values, 0 and 1, 
with P(X = 1) = p and P(X = 0) = | — p = gq, where p is any number from 
the interval [0, 1]. An experiment whose outcome is a Bernoulli random variable is 
called a Bernoulli trial. 


Definition 4.1.5 (Binomial Random Variables). A random variable X is called a 
binomial random variable with parameters n and p, if it has the binomial distribution 
(see Example 3.3.4) with probability function 


n xX AX st 
fey = ("Yo yO 5 De 35, ces (4.3) 


The distribution function of a binomial random variable is given by 


0 ifx <0 
Lx 7, 

F(x) = a ( ) pha if0<x <n (4.4) 
k=0 k 
1 ifx >n. 


Here |x] denotes the floor or greatest integer function, that is, |x] = the greatest 
integer < x. 


Example 4.1.3 (Sum of two dice). Let us again consider the tossing of two dice, with 
36 equiprobable elementary events, and let X be the sum of the points obtained. Then 
f (x) and F(x) are given by the following tables. (Count the appropriate squares in 
Figure 1.4 on p. 12.) 


x} 2/3 |] 4/5 | 6/7]8 /9 | wo] 1] 2 
f(x): | 1/36 | 2/36 | 3/36 | 4/36 | 5/36 | 6/36 | 5/36 | 4/36 | 3/36 | 2/36 | 1/36 


x € | (—oo, 2) | (2,3) | [3,4) | [4,5) | [5,6) | [6,7) 
F(x): 0 1/36 | 3/36 | 6/36 | 10/36 | 15/36 


x €| (7,8) | [8,9) | [9, 10) | [10, 11) | [11, 12) | [12, 00) 
F (x) : | 21/36 | 26/36 | 30/36 | 33/36 | 35/36 1 
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— 


Fig. 4.4. Histogram of the p.f. of the sum thrown with two dice. The y-scale shows multiples 
of 1/36. 


The histogram of f(x) and the graph of F(x) are given by the Figures 4.4 
and 4.5. 


A random variable is said to be discrete if it has only a finite or a countably 
infinite number of possible values. The random variables we have seen so far are 
discrete. In the next section, we shall discuss the most important class of nondiscrete 
random variables: continuous ones. 

Another important type of discrete variable is named in the following definition: 


2 4 6 8 10 12 14 


Fig. 4.5. Graph of the d.f. of the sum thrown with two dice. 
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Definition 4.1.6 (Discrete Uniform Random Variables). A random variable X is 
called discrete uniform if it has a finite number of possible values, say x1,.x2,... ,Xn, 
and P(X = x;) = 1/n for alli. 


Random variables with a countably infinite number of possible values occur in 
many applications, as in the next example. 


Example 4.1.4 (Throwing a die until a six comes up). Suppose we throw a fair die 
repeatedly, with the throws independent of each other, until a six comes up. Let X 
be the number of throws. Clearly, X can take on any positive integer value since it is 
possible (though unlikely) that we do not get a six in 100 throws, or 1000 throws, or 
in any large number of throws. 

The probability function of X can be computed easily as follows: 


1 
fl) = P(X = 1) = P(six on the first throw) = e 


5 1 

f (2) = P(X = 2) = P(non-six on the first throw and six on the second) = Sn 
f@) = P(X = 3) = P(non-six on the first two throws and six on the third) 

_f5y a 

6) 6’ 

and so on. 
Thus 
sy. A 
fa) = POX = = (2) £ fork = 1,2... 4.5) 


The above example is a special case of another named family of random vari- 
ables: 


Definition 4.1.7 (Geometric Random Variables). Suppose we perform indepen- 
dent Bernoulli trials with parameter p, with 0 < p < 1, until we obtain a success. 
The number X of trials is called a geometric random variable with parameter p. It 
has the probability function 


fi) =PX =k =pq! fork =1,2,.... (4.6) 


The name “geometric” comes from the fact that the f(k) values are the terms 
of a geometric series. Using the formula for the sum of a geometric series, we can 
confirm that they form a probability distribution: 


le) 7 le) or Pp 7 
LF) =) pa Stag (4.7) 


From the preceding examples we can glean some general observations about the 
probability and distribution functions of discrete random variables: 
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If x1,x2,... are the possible values of a discrete random variable X, then 
p(xj) = 0 for all these values and p(x) = 0 otherwise. Furthermore, }*> p(x;) = 1, 
because this sum equals the probability that X takes on any of its possible values, 
which is certain. Hence the total area of all the bars in the histogram of p(x) is 
1. Also, we can easily read off the histogram the probability of X falling in any 
given interval J, as the total area of those bars that cover the x; values in J. For 
instance, for the X of Example 4.1.3, P(3 < X < 6) = P(X = 4)+ P(X = 5)+ 
P(X = 6) = 3/36 + 4/36 + 5/36 = 1/3, which is the total area of the bars over 4, 
5 and 6. 

The above observations, when applied to infinite intervals of the type (—oo, x], 
lead to the equation F(x) = P(X € (—~w, x]) = eee = x;) = sum of the 
areas of the bars over each x; < x, and to the following properties of the distribution 
function: 


Theorem 4.1.1 (Properties of Distribution Functions). The distribution function 
F of any random variable X has the following properties: 


1. F(—oo) = limy-+~o9 F(x) = 0, since as x > —ocx, the interval (—co, x] > O. 
2. F(co) = limy+o0 F(x) = 1, since as x > ©, the interval (—0oo, x] > R. 
3. F is a nondecreasing function, since if x < y, then 


F(y) = P(X € (—00, y]) = P(X € (—o0, x]) + P(X € (&, y]) 
= F(x) + P(X € (x, y)), (4.8) 


and so, F(y) being the sum of F(x) and a nonnegative term, we have F(y) = 
F(x). 
4. F is continuous from the right at every point x. 


These four properties of F hold not just for discrete random variables but for all 
types. Their proofs are outlined in Exercise 4.1.13 and those following it. Also, in 
more advanced courses it is proved that any function with these four properties is the 
distribution function of some random variable. 

While the distribution function can be used for any random variable, the proba- 
bility function is useful only for discrete ones. To describe continuous random vari- 
ables, we need another function, the so-called density function, instead, as will be 
seen in the next section. 

The next theorem shows that the distribution function of a random variable X 
completely determines the distribution of X, that is, the probabilities P{X ¢€ J} for 
all intervals /. 


Theorem 4.1.2 (Probabilities of a Random Variable Falling in Various Inter- 
vals). For any random variable X and any real numbers x and y, 


1.P(X € @, y)) = FQ) — F(@), 

2. P(X € (x, y)) = lim,_,,- F@) — F(), 

3. P(X € [x, y]) = FG) — lim,.,- FQ), 

4. P(X € [x, y)) = lim,_,,- F(t) — lim,_,,- F@. 
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For discrete random variables the probability function and the distribution func- 
tion determine each other: Let x;, fori = 1,2,... , denote the possible values of X. 
Then clearly, for any x, 


F)= >> f@i) (4.9) 
and 
f(x) = F(x) — lim F(t). (4.10) 
(=x 


The first of these equations shows that F(x) is constant between successive x; 
values, and the latter equation shows that f (x;) equals the value of the jump of F at 
X= Xj. 


Exercises 


Exercise 4.1.1. Let X be the number of hearts in a randomly dealt poker hand of five 
cards. Draw a histogram for its probability function and a graph for its distribution 
function. 


Exercise 4.1.2. Let X be the number of heads obtained in five independent tosses of a 
fair coin. Draw a histogram for its probability function and a graph for its distribution 
function. 


Exercise 4.1.3. Let X be the number of heads minus the number of tails obtained in 
four independent tosses of a fair coin. Draw a histogram for its probability function 
and a graph for its distribution function. 


Exercise 4.1.4. Let X be the absolute value of the difference between the number of 
heads and the number of tails obtained in four independent tosses of a fair coin. Draw 
a histogram for its probability function and a graph for its distribution function. 


Exercise 4.1.5. Let X be the larger of the number of heads and the number of tails 
obtained in five independent tosses of a fair coin. Draw a histogram for its probability 
function and a graph for its distribution function. 


Exercise 4.1.6. Let X be the number of heads minus the number of tails obtained in 
n independent tosses of a fair coin. Find a formula for its probability function and 
one for its distribution function. 


Exercise 4.1.7. Suppose we perform independent Bernoulli trials with parameter p, 
until we obtain two consecutive successes or two consecutive failures. Draw a tree 
diagram and find the probability function of the number of trials. 
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Exercise 4.1.8. Suppose two players, A and B, play a game consisting of indepen- 
dent trials, each of which can result in a win for A or for B, or in a draw D, un- 
til one player wins a trial. In each trial, P(A wins) = p;, P(B wins) = p2, and 
P(D) = g = 1— (pi + pz). Let X =n if A wins the game in the nth trial, and 
X = Oif A does not win the game ever. Draw a tree diagram and find the probability 
function of X. Also find the probability that A wins (in any number of trials) and 
the probability that B wins. Also show that the probability of an endless sequence of 
draws is 0. 


Exercise 4.1.9. Let X be the number obtained in a single roll of a fair die. Draw a 
histogram for its probability function and a graph for its distribution function. 


Exercise 4.1.10. We roll two fair dice, a blue and a red one, independently of each 
other. Let X be the number obtained on the blue die minus the number obtained 
on the red die. Draw a histogram for its probability function and a graph for its 
distribution function. 


Exercise 4.1.11. We roll two fair dice independently of each other. Let X be the 
absolute value of the difference of the numbers obtained on them. Draw a histogram 
for its probability function and a graph for its distribution function. 


Exercise 4.1.12. Let the distribution function of a random variable X be given by 


0 ifx<-2 
1/4 if -2<x<2 

F(x) = (4.11) 
1/8 if2<x <3 
1 ifx >3. 


Find the probability function of X and graph both F and f. 


Exercise 4.1.13. Let Aj, Az2,... be a nondecreasing sequence of events on a sample 
space S, that is, let A, C An+ forn = 1,2,...,and let A = Up Ar. Prove that 
P(A) = limy-+o0P(An). Hint: Write A as the disjoint union Aj U [Uo (Ax — Ax_)] 
and apply the axiom of countable additivity. 


Exercise 4.1.14. Let A;, Az, ... be a nonincreasing sequence of events on a sample 
space S, that is, let A, D Ayj+; forn = 1,2,...,and let A = eo Ak- Prove that 
P(A) = limp—o0P(An). Hint: Apply deMorgan’s laws to the result of the preceding 
exercise. 


Exercise 4.1.15. Prove that for the distribution function of any random variable, 
limy—+—o0 F(x) = 0. Hint: Use the result of the preceding exercise and the theorem 
from real analysis that if limp—oo F (xn) = L for every sequence (x,) decreasing to 
—oo, then lim,_,_~ F(x) = L. 


Exercise 4.1.16. Prove that for the distribution function of any random variable, 
limy-so0 F(x) = 1. Hint: Use the result of Exercise 4.1.13 and the theorem from 
real analysis that if lim,.. F(x,) = L for every sequence (x,,) increasing to oo, 
then limy+o0 F(x) = L. 
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Exercise 4.1.17. Prove that the distribution function F of any random variable is 
continuous from the right at every x. Hint: Use a modified version of the hints of the 
preceding exercises. 


4.2 Continuous Random Variables 


In this section, we consider random variables X whose possible values constitute a 
finite or infinite interval and whose distribution function is not a step function, but a 
continuous function. Such random variables are called continuous random variables. 

The continuity of F implies that in Equation 4.10 lim,_, ,- F(t) = lim;., F(t) = 
F(x), for every x, and so f(x) = 0, for every x. Thus, the probability function 
does not describe the distribution of such random variables because, in this case, the 
probability of X taking on any single value is zero. The latter statement can also 
be seen directly in the case of choosing a number at random from an interval, say 
from [0, 1]: If the probability of every value x were some positive c, then the total 
probability for obtaining any x € [0,1] would be 00 - c = o, in contradiction to 
the axiom requiring the total to be 1. On the other hand, we have no problem with 
f (x) = 0 for every x, since 00 - 0 is indeterminate. 

However, even if the probability of X taking on any single value is zero, the 
probability of X taking on any value in an interval need not be zero. Now, for a 
discrete random variable, the histogram of f(x) readily displayed the probabilities 
of X falling in an interval J as the sum of the areas of the rectangles over J. Hence, a 
very natural generalization of such histograms suggests itself for continuous random 
variables: Just consider a continuous curve instead of the jagged top of the rectangles, 
and let the probability of X falling in J be the area under the curve over J. Thus we 
make the following formal definition: 


Definition 4.2.1 (Probability Density). Let X be a continuous random variable. 
If there exists a nonnegative function f that is integrable over R and for which 
tii oo f (t)dt = F(x), for all x, then f is called the probability density function 
(or briefly, the density or p.d-f.) of X, and X is called absolutely continuous. 


Thus, if X has a density function, then 


P(X € [x, y]) = F(y) — F@) = [ f(t)dt, (4.12) 


and the probability remains the same whether we include or exclude one or both 
endpoints x and y of the interval. 

While the density function is not a probability, it is often used with differential 
notation to write the probability of X falling in an infinitesimal interval as* 


x+dx 
P(X € [x, x +dx]) = f(t)dt ~ f (x)dx. (4.13) 


3 The symbol ~ means that the ratio of the expressions on either side of it tends to | as dx 
tends to 0, or equivalently, that the limits of each side divided by dx are equal. 


4.2 Continuous Random Variables 81 


Example 4.2.1 (Uniform Random Variable). Consider a finite interval [a, b], with 
a < b,and pick a point* X at random from it, that is, let the possible values of X 
be the numbers of [a,b], and let X fall in each subinterval [c,d] of [a, b] with a 
probability that is proportional to the length of [c, d] but that does not depend on the 
location of [c, d] within [a, b]. This distribution is achieved by the density function® 


1 
(aj\e0 (4.14) 
0 ifx <aorx>b. 


See Figure 4.6. Then, fora <c <d <b, 


d—-c 


d 
P(X € [c, d]) ay f@tdt = ; (4.15) 
é b-—a 


which is indeed proportional to the length d — c and does not depend on c and d in 
any other way. 
The corresponding distribution function is given by 


0 ifx <a 
x-a , 
F(x) = ifa<x <b (4.16) 
—a 
1 ifx >b. 


See Figure 4.7. 


Definition 4.2.2 (Uniform Random Variable). A random variable X with the above 
density is called uniform over [a, b], or uniformly distributed over [a, b], its distri- 
bution the uniform distribution over [a, b], and its density and distribution functions 
the uniform density and distribution functions over [a, b]. 


By the fundamental theorem of calculus, the definition of the density function 
shows that for random variables with density f 


F'(x) = f(x) (4.17) 


wherever f is continuous, and so at such points F is differentiable. There exist, how- 
ever, continuous random variables whose F is everywhere continuous but not differ- 
entiable, and which therefore do not have a density function. Such random variables 
occur only very rarely in applications, and we do not discuss them in this book. In 
fact, we shall use the term continuous random variable—as most introductory books 


4 We frequently use the words “point” and “number” interchangeably, ignoring the distinc- 
tion between a number and its representation on the number line, just as the word “interval” 
is commonly used for both numbers and points. 

5 f is not unique: its values can be changed at a countable number of points, as at a and b, 
for instance, without affecting the probabilities, which are integrals of f. 
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y 


y=fa) 


1/(b-a) 


0 a b x 
Fig. 4.6. The Uniform Density Function Over [a, ]. 


do—to denote random variables that possess a density function, instead of the precise 
term “absolutely continuous.” 

Often we know only the general shape of the density function and we need to find 
the value of an unknown constant in its equation. Such constants can be determined 
by the requirement that f satisfy [ ee f (t)dt = 1, because the integral here equals 
the probability that X takes on any value whatsoever. The next example is of this 
type. 


Example 4.2.2 (Exponential Waiting Time). Assume that the time T in minutes you 
have to wait on a certain summer night to see a shooting star has a probability density 
of the form 


0 ift <0 


Ce /'9 ft > 0. ue) 


=| 


Find the value of C and the distribution function of T and compute the probability 
that you have to wait more than 10 minutes. 


Now, 
foe) oe) me 
i) soar = f Ce gy = — 10Ce*/!9) = 100, (4.19) 
—o0 0 0 
Ya 
1 
y = FO) 
——ael _ 
0 a b x 


Fig. 4.7. The Uniform Distribution Function Over [a, b]. 
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and so C = 1/10. Thus 


0 ift <0 
j= = 4.20 
f@) nies iff >0 neu 
and, fort > 0, 
t 
1 

F(t) =P < 10) = / a eS ee (4.21) 

0 

Consequently, 

P(T > 10) = 1— F(10) =e"! ~ 0.368. (4.22) 
4 


The distribution of the example above is typical of many waiting time distribu- 
tions occurring in real life, at least approximately. For instance, the time between the 
decay of atoms in a radioactive sample, the time one has to wait for the phone to 
ring in an office, and the time between customers showing up at some store are of 
this type; the constants just differ. (The reasons for the prevalence of this distribution 
will be discussed later under the heading “Poisson process.”) 


Definition 4.2.3 (Exponential Random Variable). A random variable T is called 
exponential with parameter A > 0 if it has density 


0 ift <0 
t)= 4.23 
i) i ift>0 oe) 
and distribution function 
0 ift <0 
F(t) = 4.24 
© pe ift>0. ( ) 


There exist random variables that are neither discrete nor continuous; they are 
said to be of mixed type. Here is an example: 


Example 4.2.3 (A Mixed Random Variable). Suppose we toss a fair coin and if it 
comes up H, then X = 1, and if it comes up 7, then X is determined by spinning a 
pointer and noting its final position on a scale from 0 to 2, that is, X is then uniformly 
distributed over the interval [0, 2]. 

The distribution function F' is then given by 


0 ifx <0 
1 
rg if0<x <1 
F(x)= i 1 (4.25) 


1 if2<x 
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Fig. 4.8. A Mixed Type Distribution Function. 


and its graph is given by Figure 4.8. 


Note that 
0 ifx <0 
; if0<x<1 
F(x) = f@)= (4.26) 
z; ifl<x <2 
0 if2<x 


exists everywhere except at x = 0, | and 2, but because of the jump of F at 1, it is 
not a true density function. Indeed, 


. f (tdt ifx <1 
Fojet j (4.27) 
/ fOdt+> iflsx, 


and so F(x) 4 fee f (t)dt for all x, as required by the definition of density func- 
tions. 


Exercises 
Exercise 4.2.1. A continuous random variable X has a density of the form 


Cx if0<x <4 
_ = 4.28 
a ee eee ee 
1. Find C. 
2. Sketch the density function of X. 
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3. Find the distribution function of X and sketch its graph. 
4. Find the probability P(X < 1). 
5. Find the probability P(2 < X). 


Exercise 4.2.2. A continuous random variable X has a density of the form f(x) = 
Ce7"!, defined on all of R. 


1. FindC. 

2. Sketch the density function of X. 

3. Find the distribution function of X and sketch its graph. 
4. Find the probability P(—2 < X < 1). 

5. Find the probability P(2 < |X|). 


Exercise 4.2.3. A continuous random variable X has a density of the form 


Cc, 

— > 
fy le ifx >1 
0 ifx <1. 


1. FindC. 

2. Sketch the density function of X. 

3. Find the distribution function of X and sketch its graph. 
4. Find the probability P(X < 2). 

5. Find the probability P(2 < |X|). 


Exercise 4.2.4. A continuous random variable X has a density of the form 


Cc. 
peye lp fle) 
0 if|x| <1. 


1. FindC. 

2. Sketch the density function of X. 

3. Find the distribution function of X and sketch its graph. 
4. Find the probability P(X < 2). 

5. Find the probability P(2 < |X|). 


Exercise 4.2.5. Let X be a mixed random variable with distribution function 


0 ifx <0 

1 

Se if0<x <1 
F(x)= I 

= ifl<x <2 

3 

1 if2<x. 


1. Devise an experiment whose outcome is this X. 
2. Find the probability P(X < 1/2). 
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3. Find the probability P(X < 3/2). 

4. Find the probability P(1/2 < X < 2). 
5. Find the probability P(X = 1). 

6. Find the probability P(X > 1). 

7. Find the probability P(X = 2). 


Exercise 4.2.6. Let X be a mixed random variable with distribution function 
ifx <0 


x+ if0<x <1 


ale 


F(xy)= 
ifl<x <2 


FPWIlLNwi|l eo 


if2<x. 


. Devise an experiment whose outcome is this X. 
. Find the probability P(X < 1/2). 

. Find the probability P(X < 3/2). 

. Find the probability P(1/2 < X < 2). 

. Find the probability P(X = 1). 

. Find the probability P(X > 1). 

. Find the probability P(X = 3/2). 


NYDN WN Ke 


Exercise 4.2.7. Let X be a mixed random variable with distribution function F given 
by the graph in Figure 4.9. 


1. Find a formula for F(x). 
2. Find the probability P(X < 1/2). 
3. Find the probability P(X < 3/2). 
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Fig. 4.10. 


4. Find the probability P(1/2 < X < 2). 
5. Find the probability P(X = 1). 
6. Find the probability P(X > 1). 
7. Find the probability P(X = 2). 


Exercise 4.2.8. Let X be a mixed random variable with distribution function F given 
by the graph in Figure 4.10. 


. Find a formula for F (x). 

. Find the probability P(X < 1/2). 

. Find the probability P(X < 3/2). 

. Find the probability P(1/2 < X < 2). 
. Find the probability P(X = 1). 

. Find the probability P(X > 1). 

. Find the probability P(X = 2). 


NYDN FPWNK 


4.3 Functions of Random Variables 


In many applications we need to find the distribution of a function of a random vari- 
able. For instance, we may know from measurements the distribution of the radius 
of stars, and we may want to know the distribution of their volumes. (Probabilities 
come in—as in several examples of Chapter |—from a random choice of a single 
star.) Or, we may know the income distributions in different countries, and want to 
change scales to be able to compare them. We shall encounter many more examples 
in the rest of the book. We start off with the change of scale example in a general 
setting. 
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Example 4.3.1 (Linear Functions of Random Variables). Let X be a random variable 
with a known distribution function Fy and define a new random variable as Y = 
aX +b, where a # 0 and b are given constants. 

If X is discrete, then we can obtain the probability function fy of Y very easily 
by solving the equation in its definition for X: 


—b = 
fry) = PO = y) = Pax += y)=P(x=2=") = fy (2=2) 


(4.29) 


Equivalently, if x is a possible value of X, that is, fy (x) #0, then fy(y) = fx (x) 
for y = ax + b, which is the corresponding possible value of Y. 

If X is continuous, then we cannot imitate the above procedure, because the den- 
sity function is not a probability. We can, however, obtain the distribution function 
Fy of Y similarly, by solving the inequality in its definition for X: Fora > 0, 


Fy) =PY sy) =POX +b <y)=P(x 2") = Fy (2°), 
a a 


and fora < 0, 


Fy) =P sy) =PUX+bsy)=P(x>2—*) a1 ry (2), 
a a 
(431) 


If X is continuous with density fy, then Fy is differentiable and fy = Fy. 
As Equations 4.30 and 4.31 show, then Fy is also differentiable. Hence Y too is 
continuous, with density function 


; d =5 1_, (yb 1 y-b 
fro) = Fy) = £5 Ps (2 J=na(S )- tx (2 
(4.32) 


Example 4.3.2 (Shifting and Stretching a Discrete Uniform Variable). Let X denote 
the number obtained in the roll of a die and let Y = 2X + 10. Then the pf. of X is 


. (4.33) 
0 otherwise. 


1/6 iafx=1,2,...,6 
no= {i me 


Thus, using Equation 4.29 with this fy and with a = 2 and b = 10, we get the pf. 
of Y as 


(4.34) 


2 0 otherwise. 


fry) =f 25") Fa if y = 12,14,...,22 
YY) = Jx = 


We can obtain the same result more simply, by tabulating the possible x and 
y = 2x + 10 values and the corresponding probabilities: 
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x 1/ 2/3 )4 ]5 | 6 
y 12 | 14 | 16 | 18 | 20 | 22 
Fx@) = fy) || 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 


Example 4.3.3 (Shifting and Stretching a Uniform Variable). Let X be uniform on 
the interval [—1, 1] and let Y = 2X + 10. Then the p.d-f. of X is 


1/2 ifx e[-1, 1] 


. (4.35) 
0 otherwise. 


fx(x) = | 


If X = —1, then Y = 2(—1)+ 10 = 8, andif X = 1, then Y = 2-1+10= 12. 
Thus, the interval [—1, 1] gets changed into [8, 12], and so Equation 4.32, with the 
present fy and with a = 2 and b = 10, yields 


(4.36) 


: — 10 1/4 ify € [8,12 
fron = 5fx(2 )= {5 if y € [8, 12] 


2 0 otherwise. 


Notice that here the support of the p.d-f. got shifted and stretched in much the 
same way as the support of the p-f. in the preceding example, but there the values of 
the p.f. remained 1/6, while here the values of the p.d.f. became halved. The reason 
for this difference is clear: In the discrete case, the number of possible values has not 
changed (both X and Y had six), but in the continuous case the interval of support 
got stretched by a factor of 2 (from length 2 to length 4) and so, to compensate for 
that, in order to have a total area of 1, we had to halve the density. 

We are going to generalize the previous examples in two steps: First, we consider 
the case in which Y is an invertible function of X, and then the case in which it is 
not. 


Theorem 4.3.1 (Distribution of a One-to-one Function of a Random Variable). 
Let X be a random variable with a known distribution function Fy and define a new 
random variable as Y = g(X), where g is a one-to-one function on R, and write g~! 
for the inverse of g. 

If X is discrete, then the probability function fy of Y is obtained from the prob- 
ability function fx of X by® 


fx(g7'@)) ify = g(a) for some x € Range(X) 


fry) = 0 (4.37) 


otherwise. 


If X is of any type and g is strictly increasing, then the distribution function Fy 
of Y is obtained from the distribution function Fy of X by 


© Remember that X is a function on the sample space, and so Range(X) = set of all possible 
values of X. 
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0 if y < g(x) forall x € Range(X) 
Fy(y) = } Fx (g~'(y)) if y = g(x) for some x € Range(X) (4.38) 
1 ify > g(x) forall x € Range(X). 


If X is of any type and g is strictly decreasing, then the distribution function Fy 
of Y is obtained from the distribution function Fy of X by 


0 ify < g(x) forall x € Range(X) 
Fy(y) = 41—lim,,,+ Fx (g7'(1)) ify = g(x) for some x € Range(X) 
1 if y > g(x) forall x € Range(X). 
(4.39) 


If X is continuous and has density function fy, and g is differentiable, in addition 
to being one-to-one, then Y has a density function fy given by 


d 2 
fx (s-'0) Feo = a) if y = g(x) for some 
_ y Ig'(x)| 
frOy= x € Range(X) (a0) 
0 otherwise. 


The proof of this theorem is very much like Example 4.3.1, and is left as an 
exercise. 


Example 4.3.4 (Squaring a Binomial). Let X be binomial with parameters n = 3 and 
p = 1/2 and let Y = X*. Then we can obtain fy by tabulating the possible X and 
Y = X? values and the corresponding probabilities: 


x o}| 1/2 | 3 
y o}| 1/4] 9 
fx(x) = f(y) |] 1/8 | 3/8 | 3/8 | 1/8 


Example 4.3.5 (Squaring a Positive Uniform Random Variable). Let X be uniform 
on the interval [1, 3] and let Y = X*. Then the p.d.f. of X is 


1/2 ifx € [1,3] 


fx) = ‘ (4.41) 


otherwise. 


Now, g(X) = X 2 is one-to-one for the possible values of X, which are positive, and 
so 


Fy(y) = PY < y) = P(X? < y) =P(X < Vy) = Fx (Vy) (4.42) 


and, by the chain rule, 
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1 1 
d d -~.— ifye[l,9 
fro) = <x (V3) = fx (v3) SE = 2a yy * — (4.43) 
y y 0 otherwise. 


This result could also be obtained by substituting the present fy into Equation 
4.40. 
We can check that this fy is indeed a density function: 


[oe J 
y= 
1 4,/y 2 


Example 4.3.6 (Random Number Generation). An important application of Theorem 
4.3.1 is to the computer simulation of physical systems with random inputs. Most 
mathematical and statistical software packages produce so-called random numbers 
(or more precisely: pseudo-random numbers) that are uniformly distributed on the 
interval [0, 1]. (Though such numbers are generated by deterministic algorithms, 
they are for most practical purposes a good substitute for samples of independent, 
uniform random variables on the interval [0, 1].) Often, however, we need random 
numbers with a different distribution, and want to transform the uniform random 
numbers to new numbers that have the desired distribution. 

Suppose we need random numbers that have the continuous distribution function 
F ,such that F is strictly increasing where it is not 0 or 1. (The restrictions on F can 
be removed, but we do not want to get into this.) Then F has a strictly increasing 
inverse F—! over [0, 1], which we can use as the function g in Theorem 4.3.1. Thus, 
letting Y = F-!(X), with X being uniform on [0, 1], we have 


9 


> (v9 V1) =1. (4.44) 


1 


Fy(y) =P < y) =P(F|(X) < y) = P(X < FQ”) = FQ), (4.45) 


where the last step follows from the fact that P(X < x) = x on [0, 1] for an X that 
is uniform on [0, 1]. (See Equation 4.16.) 


Thus, if x1, .x2,... are random numbers uniform on [0, 1] produced by the gen- 
erator, then the numbers yj = F ly), y= F (x9), ... are random numbers 
with the distribution function F. 4 


If g is not one-to-one, we can still follow the procedures of Example 4.3.1 but, 
for some y, we have more than one solution of the equation y = g(x) or of the corre- 
sponding inequality, and we must consider all of those solutions, as in the following 
example. 


Example 4.3.7 (The X 2 Function). Let X be a random variable with a known distri- 
bution function Fy and define a new random variable as Y = X2. 
If X is discrete, then we can obtain the probability function fy of Y as 


P(X = +,/¥) = fx (VY) + fx (-V¥) ify >0 
fri) = PQ? = y) = {P(X =0) = fx ©) ify =0 
0 if y <0. 

(4.46) 
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For continuous X, the distribution function Fy of Y is given by 
Fy(y) = P(X* <y) 


a ee ae (4.47) 
i y= 0; 


and for discrete X , we have 


Fy(y) = ae <X < J5) = Fx (V5) — Fu (-v5) + fx (-y5)_ ify >0 
ify <0. 
(4.48) 


If X is continuous and has density function fy, then differentiating Equation 
4.47 we get 


[ fx (./y) + fx (-/y)] ify >0 


ify <0. 


1 
frQ”) = Fy(y) = 4 29 (4.49) 
0 


Example 4.3.8 (Distribution of (X — 2)? for a Binomial). Let X be binomial with 
parameters n = 3 and p = 1/2, and let Y = (X — 2)’. Rather than developing a 
formula like Equation 4.46, the best way to proceed is to tabulate the possible values 
of X and Y and the corresponding probabilities, as in Example 4.3.4: 


K 0 1 | 2 | 3 
y 4 1 | 0 1 
fx(x) |] 1/8 | 3/8 | 3/8 | 1/8 


Now, Y = | occurs when X = | or 3. Since these cases are mutually exclusive, 
PY = 1) = P(X = 1)+ P(X = 3) = 3/8 + 1/8 = 1/2. Hence, the table of fy is 


y o}| 1] 4 
fy(y) |] 3/8 | 1/2 | 1/8 


Example 4.3.9 (Distribution of X? for a Uniform X). Let X be uniform on the inter- 
val [—1, 1] and let Y = X*. Then, by Formula 4.16, 


0 ipo si 
1 
mest eeieeed (4.50) 
1 i221, 


Substituting this Fy into Equation 4.47, and observing that 
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el _ =yyr4 


= Jy, 451 
5 5 vy (4.51) 
we get 
0 ify<0 
FrYyj=1/y¥ #057 <1 (4.52) 
1 ify=1. 


We can obtain the density of Y by differentiating Fy, as 


— if0<y<l 
fry) = 3 2/y (4.53) 


0 otherwise. 


4 


The methods of the above examples can be generalized to other functions as well, 
and lead to the following theorem: 


Theorem 4.3.2 (Distribution of an Arbitrary Function of a Random Variable). 
Let X be a random variable with a known distribution function Fy and define a new 
random variable as Y = g(X), where g is any function on R. Let g~!(A) denote 
the inverse image of any set A of real numbers under the mapping g, that is, let 
g | (A) = {x : g(x) € A}. 

If X is discrete, then the probability function fy of Y is obtained from the prob- 
ability function fx of X by 


py fx(x) ify = g(x) for some x € Range(X) 
Fy) = 4 xe) (4.54) 
0 otherwise. 


If X is continuous and has density function f and I is any interval in R, then, 
for g such that {X € g! (1)} is an event,’ 


P(Y El) a f(x)dx -| f(x)dx. (4.55) 
{x:g(x)el} {x:xeg—!(DI} 


If X is of any type, then the distribution function Fy of Y is obtained from the 
distribution of X by 


Fy(y) = P(X € g7'((—00, y])) (4.56) 


provided g is such that g~'((—oe, y]) is an event. 


7 Functions that satisfy this condition are called measurable and are discussed in more ad- 
vanced books. Most functions encountered in practice, such as continuous or monotone 
functions are measurable. 
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If X is continuous and has density function fx and g is differentiable and, for all 
yeR, gl ({y}) is finite or countably infinite with g'(x) 4 0 on gl ({y}), then Y 
has a density function fy given by 


FEO) ; 
; if y = g(x) for some x € Range(X) 
fr) = 4 reg typ 8 OI (4.57) 


0 otherwise. 


We omit the proof. Also often, in particular cases, instead of substituting into the 
formulas of Theorem 4.3.2, it is easier to develop them from scratch as we did, for 
example, for Equation 4.47. 


Example 4.3.10 (Coordinates of a Uniform Random Variable on a Circle). Suppose 
that a point is moving around a circle of radius r centered at the origin of the xy 
coordinate system with constant speed, and we observe it at a random instant. What 
is the distribution of each of the point’s coordinates at that time? 

Since the point is observed at a random instant, its position is uniformly dis- 
tributed on the circle. Thus its polar angle © is a uniform random variable on the 
interval [0, 277], with constant density fo(@) = 1/(27r) there and 0 elsewhere. We 
want to find the distributions of X = r cos @ and Y = r sin. 

Now, for a given x = rcos@, there are two solutions modulo 27: 6; = 
arccos(x/r) and 62 = 27 — arccos(x/r). So if X < x, then © falls in the angle 
on the left between these two values. Thus 


0 ifx <-r 
6) —6 1 
Fy(x) =P(X <x) = 47 =~ arceos= if —r<x<r (458) 
Qn 14 r 
1 ifr <x. 
Hence 
1 ss 
i P< xX <r 
fx (x) = Fy) = 4 avr? — x? (4.59) 


0 otherwise. 


Alternatively we can obtain the density of X by direct substitution into Equa- 
tion 4.57: The y there is our x now and the x there is 6, while y = g(x) be- 
comes x = g(9) = rcos@. Then |g’(@)| = | —rsind| = rv1—cos?@ = 
rJ1—(x/r)? = Vr? — x2. Furthermore, for |x| < r, g~'({x}) = {01,02} and 
so fy(y) = D xee-lyi) fx(x)/|g’(x)| becomes fx(x) = Moet, ,o}(4/27)] : 
1/Vr2 —x? = 1/(xVr2 — x2), since there are two equal terms in the sum. This 
result is, of course, the same as before. The distribution function can now be ob- 
tained from fy by integration. 

The density of X can also be obtained directly from Figure 4.11 by using Equa- 
tion 4.13. For x > 0 and dx > 0, the variable X falls into the interval [x, x + dx] if 
and only if © falls into either of the intervals of size d@ at 6; and 62. (For negative 
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Ya 0, 


0, 


Fig. 4.11. Density of the x-coordinate of a random point on a circle. 


x or dx, we need obvious modifications.) Thus, fy (x)dx = 2 - [1/(27)]d0, and so 
fix(x) = (1/1) -[(d0)/(dx)] = (/) + 1/L(dx)/(d0)] = 1/(r Vr? — x?) as before. 

We leave the analogous computation for the distribution of the y-coordinate as 
an exercise. 


Exercises 


Exercise 4.3.1. Let X be a discrete uniform random variable with possible values 
—5, —4,...,4,5. Find the probability function and the distribution function of Y = 
X? — 3X. 


Exercise 4.3.2. Let X be a binomial random variable with parameters p = 1/2 and 
n = 6. Find the probability function and the distribution function of Y = X? — 2X. 


Exercise 4.3.3. Let X be a Bernoulli random variable with p = 1/2, and Y = 
arctan X . Find the probability function and the distribution function of Y. 


Exercise 4.3.4. Let X be a discrete random variable with probability function fx. 
Find formulas for the probability function and the distribution function of Y = 
(X — a)”, where a is an arbitrary constant. 


Exercise 4.3.5. Let X be a random variable uniformly distributed on the interval 
[0, 1], and Y = In X. Find the density function and the distribution function of Y. 


Exercise 4.3.6. Let X be a random variable uniformly distributed on the interval 
[—1, 1], and Y = |X|. Find the density function and the distribution function of Y. 


Exercise 4.3.7. Let X be a continuous random variable with density function fy. 
Find formulas for the density function and the distribution function of Y = |X|. 
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Exercise 4.3.8. Assume that the distribution of the radius R of stars has a density 
function fr. Find formulas for the density and the distribution function of their vol- 
ume V = (4/3)R?x. 


Exercise 4.3.9. Find the density and the distribution function of Y in Example 
4.3.10. 


Exercise 4.3.10. Let X be a continuous random variable with density fy. Find for- 
mulas for the density and the distribution function of Y = (X — a)”, where a is an 
arbitrary constant. 


Exercise 4.3.11. Let X be a continuous random variable with a continuous distribu- 
tion function F that is strictly increasing where it is not 0 or 1. Show that the random 
variable Y = F(X) is uniformly distributed on the interval [0, 1]. 


Exercise 4.3.12. Let X be a random variable uniformly distributed on the interval 
[—2, 2], and Y = (X — 1)”. 


(a) Find the density function and the distribution function of X. 
(b) Find the distribution function and the density function of Y. 


4.4 Joint Distributions 


In many applications, we need to consider two or more random variables simulta- 
neously. For instance, the two-way classification of voters in Example 3.3.3 can be 
regarded to involve two random variables, if we assign numbers to the various age 
groups and party affiliations. 

In general, we want to consider joint probabilities of events defined by two or 
more random variables on the same sample space. The probabilities of all such events 
constitute the joint distribution or the bivariate (for two variables) or multivariate 
(for more than two variables) distribution of the given random variables and can be 
described by their joint p-f., d-f., or p.d-f., much as for single random variables. 


Definition 4.4.1 (Joint Probability Function). Let X and Y be two discrete ran- 
dom variables on the same sample space. The function of two variables defined by 
f(x,y) = P(X =x,Y = y)8 for all possible values’ x of X and y of Y, is called 
the joint or bivariate probability function of X and Y or of the pair (X, Y). 

Similarly, for a set of n random variables on the same sample space, with n a 
positive integer greater than 2, we define the joint or multivariate probability function 
of (X1, X2,... , Xn) as the function given by 


f (41, X2,-.. Xn) = P(X, = x1, X2 = X2,..., Xn = Xn), 
for all possible values x; of each X;, or for all (x;, x2,... ,X%,) € R". 


8 P(X = x, Y = y) stands for P(X = x and Y = y) = P({X =x}N{Y = y}). 
9 Sometimes Ff (x, y) is defined for all real numbers x, y, with f(x, y) =O if P(X =x) =0 
or P(Y = y) = 0. 
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If for two random variables we sum f (x, y) over all possible values y of Y, then 
we get the (marginal)!'® probability function fy (or f;) of X. Indeed, 


Een = Dew =s¥=n=r(x=sin(Ue =n) 


= P(X =x} NS) =P(X =x) = fx(x). (4.60) 


Similarly, if we sum f(x, y) over all possible values x of X, then we get the 
probability function fy (or f2) of Y, and if we sum f(x, y) over all possible values 
x of X and y of Y both, in either order, then, of course, we get 1. 

For n random variables, if we sum f (x1, x2,... ,Xn) over all possible values x; 
of any X;, then we get the joint (marginal) probability function of the n — 1 random 
variables X ; with j ¢ i, and if we sum over all possible values of any k of them, 
then we get the joint (marginal) probability function of the remaining n — k random 
variables. 


Definition 4.4.2 (Joint Distribution Function). Let X and Y be two arbitrary ran- 
dom variables on the same sample space. The function of two variables defined by 
F(x,y) = P(X < x,Y < y), for all real x and y, is called the joint or bivariate 
distribution function of X and Y or of the pair (X, Y). 

The functions!! Fy(x) = F(x,00) and Fy(y) = F(o@, y) are called the 
(marginal) distribution functions of X and Y. 

Similarly, for a set of n random variables on the same sample space, with n a pos- 
itive integer greater than 2, we define the joint or multivariate distribution function 
of (X1, X2,... , Xn) as the function given by F(x}, x2,...,Xn)P(X1 < x1, X2 < 
X2,...,Xn < Xn), for all real numbers x1, x2,... , Xn. 

If we substitute oo for any of the arguments of F'(x1, x2,... ,Xn), we get the 
marginal d.f.’s of the random variables that correspond to the remaining arguments. 


For joint distributions, we have the following obvious theorem: 


Theorem 4.4.1 (Joint Distribution of Two Functions of Two Discrete Random 
Variables). [f X and Y are two discrete random variables with joint probability 
function fy y(x, y) and U = g(X, Y) and V = h(X, Y) any two functions, then the 
joint probability function of U and V is given by 


fuvwn= > >> fx.v(,y). (4.61) 


(x, y):g (x, y)=u, h(x, y)=v 


Example 4.4.1 (Sum and Absolute Difference of Two Dice). Roll two fair dice as in 
Example 1.3.3, and let X and Y denote the numbers obtained with them. Find the 
joint probability function of U = X + Y and V = |X — Y|. 

First, we construct a table of the values of U and V, for all possible outcomes x 
and y (see Table 4.1): 


10 The adjective “marginal” is really unnecessary; we just use it occasionally to emphasize 
the relation to the joint distribution. 
u F (x, 00) is shorthand for limy—o9 F(x, y), ete. 
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Table 4.1. The values of U = X + Y and V = |X — Y| for the numbers X and Y showing on 
two dice. 


y\x || 1 | 2 | 3 | 4 5 6 


5.3 | 62 | 7,1 | 80 9,1 10.2 


8.2 | 9,1 10.0 | 11,1 


nl wnt az 
oO 
uN 
~ 
eo 


75 | 84 | 93 } 10.2 | 11,1 | 12.0 


By assumption, each pair of x and y values has probability 1/36, and so each 
pair (u,v) of U and V values has as its probability 1/36 times the number of 
boxes in which it appears. Hence, for instance, fy,y (3, 1) = P(U = 3, V = 1) = 
POX = 1,Y =2)4+ P(X =2,Y = 1) = 2/36. Thus, the joint probability function 
fu.v(u, v) of U and V is given by Table 4.2, with the marginal probability func- 
tion fy (u) shown as the row sums on the right margin and the marginal probability 
function fy (v) shown as the column sums on the bottom margin. 


Table 4.2. The joint and marginal probability functions of U = X + Y and V = |X — Y| for 
the numbers X and Y showing on two dice. 


u\v 0 1 2 3 4 5 fu (lu) 
2 1/36 0 0 0 0 0 1/36 
3 0 2/36 0 0 0 0 2/36 
4 1/36 0 2/36 0 0 0 3/36 
5 0 2/36 0 2/36 0 0 4/36 
6 1/36 0 2/36 0 2/36 0 5/36 
7 0 2/36 0 2/36 0 2/36 6/36 
8 1/36 0 2/36 0 2/36 0 5/36 
9 0 2/36 0 2/36 0 0 4/36 
10 1/36 0 2/36 0 0 0 3/36 
11 0 2/36 0 0 0 0 2/36 
12 1/36 0 0 0 0 0 1/36 


fv(v) || 6/36 | 10/36 | 8/36 | 6/36 | 4/36 | 2/36 I 
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Table 4.3. The values of X = max(X1, X2, X3) and Y = min(X1, X2, X3). 


X, fl1}afafafirj1]2}2}2]2]}2]/2)/313/3]3/3}3)4]4}4]4]4}4 
X, |12}2/3/3/4]/4]1/1/3]/3/}4}4]ifii2}2t4}4f{iafif2}2}3}3 
X3 |13/4/2/4/2/3/3/4]1]4]1/3]2/4]1]4}1}/2]2/3]1/3]1/2 
X 13/413/4/]4)/4]/3/413]4]4]4/3]4/3]4]4/4)4l4l4]4]4}4 
y fiafafajajafasafafas2}af2;afafaf2saf2tafafas2faf2 


Example 4.4.2 (Maximum and Minimum of Three Integers). Choose three numbers 
X 1, X2, X3 without replacement and with equal probabilities from the set {1, 2, 3, 4}, 
and let X = max{X 1, X2, X3} and Y = min{X,, X2, X3}. Find the joint probability 
function of X and Y. 

First, In Table 4.3 we list the set of all 24 possible outcomes, together with the 
values of X and Y: 

Now, each possible outcome has probability 1/24, and so we just have to count 
the number of times each pair of X, Y values occurs and multiply it by 1/24 to get 
the probability function f(x, y) of (X, Y). This p-f. is given in Table 4.4, together 
with the marginal probabilities fy (y) on the right and fx (x) at the bottom. 


Table 4.4. The joint p.f. and marginals of X = max(X 1, X2, X3) and Y = min(X1, X92, X3). 


y\x 3 | 4 || Any x 


1 1/4 | 1/2 3/4 


2 O | 1/4 1/4 


Any y |} 1/4] 3/4 |} 1 


Example 4.4.3 (Multinomial Distribution). Suppose we have k types of objects and 
we perform n independent trials of choosing one of these objects, with probabilities 
P1, P2,--- , Pk for the different types in each of the trials, where py + p2+---+pk = 
1. Let Nj, N2,... , Nx denote the numbers of objects obtained in each category. 
Then clearly, the joint probability function of N1, N2,..., Nx is given by 


f(ny,n2,...,Nk) = P(N) = 11, Nz =n2,... , Ne = nk) 


n ny nz Nk 
=( Joi Py "Dy (4.62) 
nN1,N2,... ,Nk 
for every choice of nonnegative integers n1,72,... ,ng withny +n2+---+nzp =n, 
and f(n1,n2,...,nx) = 0 otherwise. ¢ 


Next, we consider the joint distributions of continuous random variables. 
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Definition 4.4.3 (Joint Density Function). Let X and Y be two continuous random 
variables on the same probability space. If there exists an integrable nonnegative 
function f (x, y) on R? such that 


d eb 
P(a < X <b,c < Y <d) = i. FQ, y)dxdy (4.63) 


for all real numbers a, b,c, d, then f is called the joint or bivariate probability den- 
sity function of X and Y or of the pair (X, Y), and X and Y are said to be jointly 
continuous. 

Similarly, for a set of n continuous random variables on the same probability 
space, with n a positive integer greater than 2, if there exists an integrable nonnega- 
tive function f (x1, .x2,... Xn) on R” such that, for any coordinate rectangle!” R of 
R” : 


P((X1, X2,... xe R= fof SF (%1, X2,.-. ,Xn)dx1 +++ dX, (4.64) 
R 


then f is called the joint or multivariate probability density function of Xj, X2,..., 
Xp or of the point or vector (Xj, X2,... , Xn), and X1, X2,... , Xn are said to be 
jointly continuous. 

Similarly as for discrete variables, in the continuous bivariate case en St (x, y)dx 
= fy(y) is the (marginal) density of Y, and ine f(x, y)dy = fx(x) is the 
(marginal) density of X. In the multivariate case, integrating the joint density over 
any k of its arguments from —oo to 00, we get the (marginal) joint density of the 
remaining n — k random variables. 


The relationship between the p.d-f. and the df. is analogous to the one for a single 
random variable: For a continuous bivariate distribution 


F(x, y)=P(X <x,Y <y)= [ [ f(s, t)dsdt, (4.65) 
and 
a? F (x, 
f@,y= ay e » (4.66) 
xdy 


wherever the derivative on the right-hand side exists and is continuous. Similar rela- 
tions exist for multivariate distributions. 

An important class of joint distributions is obtained by generalizing the notion of 
a uniform distribution on an interval to higher dimensions: 


Definition 4.4.4 (Uniform Distribution on Various Regions). Let D be a region of 
IR”, with n-dimensional volume V. Then the point (X1, X2,... , Xn) is said to be 


12 That is, a Cartesian product of n intervals; one from each coordinate axis. 
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y 
1 

> 
0 1 x 


Fig. 4.12. D is the shaded area. 


chosen at random or uniformly distributed on D, if its distribution is given by the 
density function!? 


ae = Ei if (x1,X2,... Xn) € D ae 


0 otherwise. 


Example 4.4.4 (Uniform Distribution on the Unit Square). Let D be the closed unit 
square of R?, that is, D = {(x,y) :0< x < 1,0 < y < 1}. Then the random 
point (X, Y) is uniformly distributed on D, if its distribution is given by the density 
function 


fay) = i oon (4.68) 


0 otherwise. 


Clearly, the marginal densities are the uniform densities on the [0, 1] intervals of 
the x and y axes, respectively. 


Example 4.4.5 (Uniform Distribution on Part of the Unit Square). Let D be the union 
of the lower-left quarter and of the upper-right quarter of the unit square of R?, that 
is, D = {(x, y) :0 <x <1/2,0<y <1/2}U{@, y):1/2<x<11/2<y<I} 
as shown in Figure 4.12. 

Then, clearly, the area of D is 1/2, and so the density function of a random point 
(X, Y), uniformly distributed on D is given by 


2 if(x,y)EeD 
f@,y)= is (4.69) 
0 otherwise. 


'3 Note that it makes no difference for this assignment of probabilities whether we consider 
the region D open or closed or, more generally, whether we include or omit any set of 
points of dimension less than n. 
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The surprising thing about this distribution is that the marginal densities are again 
the uniform densities on the [0, 1] intervals of the x- and y-axes, just as in the previ- 
ous example, although the joint density is very different and not even continuous on 
the unit square. 


Example 4.4.6 (Uniform Distribution on a Diagonal of the Unit Square). Let D again 
be the unit square of R?, that is, D = {(x,y):0<x < 1,0 < y < 1}, and let the 
random point (X, Y) be uniformly distributed on the diagonal y = x between the 
vertices (0, 0) and (1, 1), that is, on the line-segment L = {(x, vy): y=x,0<x< 
1}. In other words, assign probabilities to regions A in the plane by 


length(A N L) 
a ; 


Clearly, again, the marginal densities are the uniform densities on the [0, 1] in- 
tervals of the x- and y-axes, respectively. Note, however, that X and Y are not jointly 
continuous (nor discrete) and do not have a joint density function, in spite of X and 
Y being continuous separately. 


P(X, Y) € A) = (4.70) 


Example 4.4.7 (Uniform Distribution on the Unit Disc). Let D be the unit disc of 
R?, that is, D = {(x, y): ae 5 he < 1}. Then the random point (X, Y) is uniformly 
distributed on D, if its distribution is given by the density function 


I/xn if(x,y)€D 


: (4.71) 
0 otherwise. 


few =| 


The marginal density of X is obtained from its definition fx (x) = se St (x, y)dy. 
Now, for any fixed x € (—1,1), f(x,y) # 0 if and only if -/1—x?2 < y < 
/1 — x2 and so, for such x 


00 a/ 1—x?2 1 2 
/ fs. yidy = f —dy = —V1—x?. (4.72) 
—oo —a/1—x2 0 a 
Thus, 
(2/m)V/1—x? ifx € (-1,1) 
f(xy = {0 7 (4.73) 
0 otherwise. 
By symmetry, the marginal density of Y is the same, just with x replaced by y: 
(2/m)/1—y? ify €(-1,1) 
frepsy NE BYES (4.74) 
0 otherwise. 
4 


Frequently, as for single random variables, we know the general form of a joint 
distribution except for an unknown coefficient, which we determine from the require- 
ment that the total probability must be 1. 
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ya 


7 7 1 x 


Fig. 4.13. The range of x for a given y. 


Example 4.4.8 (A Distribution on a Triangle). Let D be the triangle in R* given by 
D={(x,y):0<x,0<y,x+y < 1}, and let (X, Y) have the density function 


Py . 
f@, y= eee (4.75) 


0 otherwise. 


Find the value of C and compute the probability P(X < Y). 
Then, by Figure 4.13, 


1 l-y 
j= // f(a, y)dxdy = // Cxy*dxdy = / i Cxy7dxdy 
R2 D 0 0 


ty ry 
ey sl -»¥ydy=c f ~(y* — 2y? + y*)dy 
0 2 0 2 


aj 1 11 _€ (4.76) 
~~2\3 2° 5) 60° 


Thus C = 60. 

To compute the probability P(X < Y) we have to integrate f over those values 
(x, y) of (X, Y) for which x < y holds, that is, for the half of the triangle D above 
the y = x line. (See Fig. 4.14.) Thus 


1/2 pl—x iz 53 ae 
P(X < Y) = 60 / / xy"dydx = 60 / x1 dx 
o dx 0 ae 


1/2 1/2 
20 [ ¥ [a =P = | di 20 [ (« — 3x2 4.3x3 — 2x*) dx 
0 0 


lft) ly weet) ee) le v 
>(5) (5) +3(5) = (5) ~ 16 an 
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0 x 1/2 1 x 
Fig. 4.14. The integration limits for P(X < Y). 


The second part of the above example is an instance of the following general 
principle: If (X, Y) is continuous with joint p.df. f, and A is any set!* in R?, then 


pux.Yyeay= ff sox yydxay. (4.78) 
A 


In particular, if the set A is defined by a function g so that A = {(x, y) : g(x, y) < a}, 
for some constant a, then 


P(g(X, Y) <a) = // F(x, y)dxdy. (4.79) 
{g(x,y)<a} 


Relations similar to Equations 4.78 and 4.79 hold for discrete random variables 
as well, we just have to replace the integrals by sums. 

Equation 4.79 shows how to obtain the d.f. of a new random variable Z = 
g(X, Y). This is illustrated in the following example. 


Example 4.4.9 (Distribution of the Sum of the Coordinates of a Point). Let the ran- 
dom point (X, Y) be uniformly distributed on the unit square D = {(x, y):0<x < 
1,0 < y < 1}, as in Example 4.4.4. Find the df.of Z= X+Y. 

By Equation 4.79, (see Fig. 4.15) 


Fo@)=PX+¥ sa = ff Fes. ydedy = ff dxdy 
{x+y<z} {x+y<z}ND 


Area of D under the line x + y = z 


0 ifz <0 
212 if0 < 1 
ey eee (4.80) 
(=(@=<2)7/2] f1=<22 
1 if2 <% 


14 More precisely, A is any set in IR? such that {s : (X(s), Y(s)) € A} is an event. 
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and so the p.d-f. of Z is 


0) ifz <0 
Zz if0<z<l 
Q=FLZa= ~ 4.81 
WO EO ey os (4.81) 
0 th 2. 


va 


x+y 


Fig. 4.15. The region {x + y < z}M D, depending on the value of z. 


Exercises 


Exercise 4.4.1. Roll two dice as in Example 4.4.1. Find the joint probability function 
of U=X+YandV=X-Y. 


Exercise 4.4.2. Roll two dice as in Example 4.4.1. Find the joint probability function 
of U = max(X, Y) and V = min(X, Y). 


Exercise 4.4.3. Roll six dice. Find the probabilities of obtaining 


1. each of the six possible numbers once, 
2. one 1, two 2’s, and three 3’s. 


Exercise 4.4.4. Let the random point (X, Y) be uniformly distributed on the triangle 
D= {(x,y):0<.x <y < 1}. Find the marginal densities of X and Y and plot their 
graphs. 

Exercise 4.4.5. Let the random point (X,Y) be uniformly distributed on the unit 


disc D = {(x, y): ree y? < 1}. Find the df. and the p.d-f. of the point’s distance 
Z =X? + Y? from the origin. 
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Exercise 4.4.6. Let (X, Y) be continuous with density f(x, y) = Ce~*~?) for x > 
0, y > 0 and 0 otherwise. Find 


1. the value of the constant C, 

2. the marginal densities of X and Y, 
3. the joint d.f. F(x, y), 

4. P(X <Y). 


Exercise 4.4.7. Let (X, Y) be continuous with density f(x, y) = Cxy? on the trian- 
gle D = {(x, y):0 <x < y < 1} and 0 otherwise. Find 


1. the value of the constant C, 

2. the marginal densities of X and Y, 
3. the joint df. F(x, y), 

4. P(X > Y?). 


Exercise 4.4.8. Let the random point (X, Y) be uniformly distributed on the square 
D={(x,y):-l <x <1,-—1< y < 1}. Find the df. and the p.df.of Z=X+Y. 


Exercise 4.4.9. Show that, for any random variables X and Y and any real numbers 
XxX, < X2 and yi < y2; 


P(x, < X < x2, 91 < Y < yo) = F(x, yo) — F(x, y2) + F(X, v1) — Fo, y1). 


4.5 Independence of Random Variables 


The notion of the independence of events can easily be extended to random variables, 
by applying the product rule to their joint disributions. 


Definition 4.5.1 (Independence of Two Random Variables). Two random vari- 
ables X and Y are said to be independent of each other if, for all intervals A 
and B, 


P(X € A,Y € B)=P(X € A)P(Y € B). (4.82) 
Equivalently, we can reformulate the defining condition in terms of F or f: 


Theorem 4.5.1 (Alternative Conditions for Independence of Two Random Vari- 
ables). Two random variables X and Y are independent of each other if and only if 
their joint df. is the product of their marginal df-s: 


F(x, y) = Fx(x)FyQ) forallx,y. (4.83) 


Two discrete or absolutely continuous random variables X and Y are independent of 
each other if and only if their joint p f. or p.df. is the product of their marginal p,f’s 
or p.df’s: 


fa y= fx@)fr(Yy) — forallx,y. (4.84) 
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Table 4.5. The joint p.f. and marginals of two discrete dependent random variables. 


y\x 3 4 Any x 
1 1/4 | 1/2 3/4 
2 0 1/4 1/4 
Any y || 1/4 | 3/4 1 


Proof. If in Definition 4.5.1 we choose A = (—oo, x] and B = (—ov, y], then we 
get Equation 4.83. Conversely, if Equation 4.83 holds, then Equation 4.82 follows 
for any intervals from Theorem 4.1.2. 

For discrete variables, Equation 4.84 follows from Definition 4.5.1 by substi- 
tuting the one point intervals A = [x,x] and B = [y, y], and for continuous vari- 
ables by differentiating Equation 4.83. Conversely, we can obtain Equation 4.83 from 
Equation 4.84 by summation or integration. a 


Example 4.5.1 (Two Discrete Examples). In Example 4.4.2 we obtained Table 4.5 for 
the joint pf. f and the marginals of two discrete random variables X and Y. 

These variables are not independent, since f(x,y) # fx(x)fy(y) for all x, y. 
For instance, f(3, 1) = 1/4 but fx (3) fy) = (1/4) - (3/4) = 3/16. (Note that we 
need to establish only one instance of f(x, y) # fx (x) fy() to disprove indepen- 
dence, but to prove independence we need to show f(x,y) = fx(x) fy() for all 
x,y.) 

We can easily construct a table for the f, with the same x, y values and the same 
marginals, that represents the distribution of independent X and Y. All we have to 
do is to make each entry f(x, y) equal to the product of the corresponding numbers 
in the margins. (See Table 4.6.) 4 


These examples show that there are usually many possible joint distributions for 
the given marginals, but only one of those represents independent random variables. 


Example 4.5.2 (Independent Uniform Random Variables). Let the random point 
(X, Y) be uniformly distributed on the rectangle D = {(x,y):a<x<b,c< 
y <d}.Then 


Table 4.6. The joint p.f. and marginals of two discrete independent random variables. 


y\x 3 4 Any x 
1 3/16 | 9/16 3/4 
2 1/16 | 3/16 1/4 


Any y || 1/4 | 3/4 1 
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1/[((b-—a)(d—c)] if(x,y)e€D 


(4.85) 
0 otherwise 


few =| 


and the marginal densities are obtained by integration as 


oo a dy = 1 fa<eceh 
fey = | fusp=l|. @=od=6 6-0 °° 


0 otherwise 

(4.86) 
and 
b 
d 1 
0 7 ad = ife<y<d 
fro) = f(x, y)dx = yJqg (b-a)\d—c) (d—-c) 

=e 0 otherwise. 

(4.87) 


Hence X and Y are uniformly distributed on their respective intervals and are 
independent, since f(x,y) = fx(x) fy) for all x, y, as the preceding formulas 
show. 

Clearly, the converse of our result is also true: If X and Y are uniformly dis- 
tributed on their respective intervals and are independent, then fy(x) fy(y) yields 
the p.d-f. 4.85 of a point (X, Y) uniformly distributed on the corresponding rectan- 
gle. 


Example 4.5.3 (Uniform (X, Y) on the Unit Disc). Let the random point (X, Y) be 
uniformly distributed on the unit disc D = {(x, y) : x7 + y* < 1}. In Example 4.4.7 
we obtained 


_fi/x if@yyeD 

fy) = otherwise, soe) 

—y2 ifx = 
aie on x ifxe 1,1) ess 

0 otherwise, 

and 
2 1-—y2 if —-1,1 

froy= {20 a paid (4.90) 

0 otherwise. 


Now, clearly, f(x, y) = fx (x) fy (y) does not hold for all x, y, and so X and Y 
are not independent. 

Note that this result is in agreement with the nontechnical meaning of depen- 
dence: From the shape of the disc, it follows that some values of X more or less 
determine the corresponding values of Y (and vice versa). For instance, if X is close 
to +1, then Y must be close to 0, and so X and Y are not expected to be independent 
of each other. 
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x 


Fig. 4.16. 


Example 4.5.4 (Constructing a Triangle). Suppose we pick two random points X 
and Y independently and uniformly on the interval [0, 1]. What is the probability 
that we can construct a triangle from the resulting three segments as its sides? 

A triangle can be constructed if and only if the sum of any two sides is longer 
than the third side. In our case, this condition means that each side must be shorter 
than 1/2. (Prove this!) Thus X and Y must satisfy either 


1 1 1 
0<X<-, 0<Y-X<-+, —<Y<l, (4.91) 
2 2 2 
or 
1 1 1 
0<Y<-=, O<X-Y«<-~-, =<X <1. (4.92) 
2 2 2 


By Example 4.5.2 the given selection of the two points X and Y on a line is 
equivalent to the selection of the single point (X, Y) with a uniform distribution 
on the unit square of the plane. Now, the two sets of inequalities describe the two 
triangles at the center, shown in Fig. 4.16, and the required probability is their area: 
1/4. 


Next, we present some theorems about independence of random variables. 


Theorem 4.5.2 (A Constant is Independent of Any Random Variable). Let X = 
c, where c is any constant, and let Y be any r.v. Then X and Y are independent. 


Proof. Let X = c, and let Y be any r.v. Then Equation 4.83 becomes 
Pee<x,¥ <y)=P(c<x)PY <y), (4.93) 


and this equation is true because for x > c and any y it reduces to P(Y < y) = 
P(Y < y), and for x < c it reduces to0 = 0. B 


Theorem 4.5.3 (No Nonconstant Random Variable is Independent of Itself). Let 
X be any nonconstant random variable and let Y = X.Then X and Y are dependent. 
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Proof. Let A and B be two disjoint intervals for which P(X € A) > 0 and P(X € 
B) > 0 hold. Since X is not constant, such intervals clearly exist. If Y = X, then 
P(X € A, Y € B) = 0, but P(X € A)P(Y € B) > 0, and so Equation 4.82 does not 
hold for all intervals A and B. B 


Theorem 4.5.4 (Independence of Functions of Random Variables). Let X and Y 
be independent random variables, and let g and h be any real-valued measurable 
functions (see the footnote on page 93) on Range(X) and Range(Y), respectively. 
Then g(X) and h(Y) are independent. 


Proof. We give the proof for discrete X and Y only. Let A and B be arbitrary inter- 
vals. Then 


PQ(XEAhWeB)= Yi So P(X=x,¥=y) 
{x:g(v)EA} {y:h(y)eB} 
= 3 > P(X = x)P(Y = y) 
{x:g(x)EA} {y:h(y)eB} 
=> >} Pee YY PSH 
{x:g(x)eA} {y:h(y)€B} 
= P(g(X) € A)P(h(Y) € B). (4.94) 


We can extend the definition of independence to several random variables as well, 
but we need to distinguish different types of independence, depending on the number 
of variables involved: 


Definition 4.5.2 (Independence of Several Random Variables). Let X;, X2,... , 
Xn, forn = 2,3,... , be arbitrary random variables. 
They are (totally) independent, if 


P(X; € Ay, X2 € Ao,... , Xn € An) = P(X] € Aj)P(X2 € Az)--- P(X, € An) 


(4.95) 
for all intervals Aj, A2,..., An. 
They are pairwise independent if 
P(X; € Aj, Xj; € Aj) = POX] € Aj)P(X; € Aj) (4.96) 


for all i A j and all intervals A;, Aj. 


Note that in the case of total independence, it is not necessary to require the prod- 
uct rule for all subsets of the n random variables (as we had to for general events), 
because the product rule for any number less than n follows from Equation 4.95 by 
setting A; = R for all values of i that we want to omit. On the other hand, pairwise 
independence is a weaker requirement than total independence: Equation 4.96 does 
not imply Equation 4.95. Also, we could have defined various types of independence 
between total and pairwise, but such types generally do not occur in practice. 

We have the following theorems for several random variables, analogous to The- 
orem 4.5.1 and Theorem 4.5.4, which we state without proof. 
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Theorem 4.5.5 (Alternative Conditions for Independence of Several Random 
Variables). Any random variables X,, X2,... , Xn, forn = 2,3,... are indepen- 
dent of each other if and only if their joint df. is the product of their marginal df.s: 


F(x1,%2,...,Xn) = Fy (x1) Fo(x2)-- +: Fy(%n) for all x1, x2,...,Xn. (4.97) 


Also, any discrete or absolutely continuous random variables X\, X2,..., Xn; 
forn =2,3,... are independent of each other if and only if their joint p,f. or p.df. 
is the product of their marginal pfs or p.df-s: 

Sf (x1, X2,... Xn) = fi(x1) folx2) +++ fn(Xn) for all x1, x2,... >Xn- (4.98) 


Theorem 4.5.6 (Independence of Functions of Random Variables). Let X,, X2, 
...,Xn, forn = 2,3,... be independent random variables, and let the g; be 
real-valued measurable functions on Range(X;) fori = 1,2,... ,n. Then gi(X1), 
82(X2),.-- 5 8n(Xn) are independent. 


Theorem 4.5.6 could be further generalized in an obvious way by taking the g; to 
be functions of several, nonoverlapping variables. For example, in the case of three 
random variables, we have the following theorem: 


Theorem 4.5.7 (Independence of g(X, Y) and Z). If Z is independent of (X,Y), 
then Z is independent of g(X, Y), too, for any measurable function g. 


Proof. We give the proof for jointly continuous X, Y and Z only. 
For arbitrary ¢ and z, 


PIX snZs2=f // f (x, y, ¢)dxdyde 
—oo g(x,y)St 


_ / | / / fry Oy) fz (o)dxdydg 
—o0o g(x,y)<t 


- i / fr. (x, y)dxdy i ” fz(s)ds 
g(x,y)St —0oo 
P(e(X,Y) <1) P(Z <2). (4.99) 


By Theorem 4.5.1, Equation 4.99 proves the independence of g (X, Y) and Z. 
| 


In some applications, we need to find the distribution of the maximum or the 
minimum of several independent random variables. This can be done as follows: 


Theorem 4.5.8 (Distribution of Maximum and Minimum of Several Random 
Variables). Let X,, X2,..., Xn, forn = 2,3,..., be independent, identically dis- 
tributed (abbreviated i.i.d.) random variables with common df. Fx and let Y = 
max{X,, X2,...,Xn} and Z = min{X1, X2,... te Then the distribution 
functions of Y and Z are given by 


'5 Note that the max and the min must be taken pointwise, that is, foreach sample point s we 
must consider the max and the min of {Xj (s), X2(s),... , Xn(s)}, and so Y and Z will in 
general be different from each of the X;. 
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Fy(y) = [Fx(y)]” forally eR (4.100) 

and 
Fz(z)=1—-[1— Fx(z)]" forallz ER. (4.101) 
Proof. For any y € R, Y = max{X1, X2,..., Xn} < y holds if and only if, for 


every i, X; < y. Thus, we have 


Fy(y) = P(X <y,X2 Sy,...,Xn < y) 
= P(X < y)P(X2 < y)---P(Xn < y) =[FxQ)]". (4.102) 


Similarly, 


Fz(z) =P(Z <z)=1-P(Z > z) 
=1-—P(X; >z,X2>z,...,Xn > 2) 
= 1 — P(X > z)P(X2 > z)---P(Xn > z) 
=1-[l— Fy@)]’. (4.103) 


Example 4.5.5 (Maximum of Two Independent Uniformly Distributed Points). Let X 
and X be independent, uniform random variables on the interval [0, 1]. Find the df. 
and the p.df. of Y = max{X 1, X>}. 

By Equation 4.16, 


ifx <0 
Fy(x)= 4x if0<x <1 (4.104) 
1 ifx>1, 
and so, by Theorem 4.5.8 
ify <0 
Fy(y)=}y* if0<y<1 (4.105) 
1 ify>1. 


Hence the p.d-f. of Y is given by 


2y fO<y<1l 
= 7s 4.106 
PEON se ee 124, (4.106) 


which shows that the probability of Y = max{X,, X2} falling in a subinterval of 
length dy is no longer constant over [0, 1], as for Xj or X2, but increases linearly. 
The two functions above can also be seen in Figure 4.17 below. The sample 
space is the set of points s = (x1, x2) of the unit square and, for any sample point s, 
X\(s) = x; and X2(s) = x2. The sample points are uniformly distributed on the unit 
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Fig. 4.17. The d.f and the p.d-f. of Y = max{X 1, X} for two i..d. uniform r.v.’s on [0, 1]. 


square, and so the areas of subsets give the corresponding probabilities. Since for any 
sample point s above the diagonal x; < x2 holds, Y(s) = x2 there and, similarly, 
below the diagonal Y(s) = x,. Thus, the set {s : Y(s) < y} is the shaded square of 
area y’, and the thin strip of width dy, to the right and above the square, has area 
~ 2ydy. 4 


Another, very important function of two independent random variables is their 
sum. We have the following theorem for its distribution: 


Theorem 4.5.9 (Sum of Two Independent Random Variables). Let X and Y be 
independent random variables and Z = X + Y.If X and Y are discrete, then the pf. 
of Z is given by 


fz@= D> fe@fro) =o fx@) fre —-x), (4.107) 


x+y=z 


where, for a given z, the summation is extended over all possible values of X and 
Y for which x + y = z, if such values exist. Otherwise fz(z) is taken to be 0. The 
expression on the right-hand side is called the convolution of fx and fy. 

If X and Y are continuous with densities fx and fy, then the density of Z = 
X +Y is given by 


fz(%) = SxQ@) fr @ — x)dx, (4.108) 


where the integral is again called the convolution of fx and fy. 


Proof. In the discrete case, Equation 4.107 is obvious. 

In the continuous case, Z falls between z and z+dz if and only if the point (X, Y) 
falls in the oblique strip between the lines x + y = z and x + y = z+ dz, shown 
in Figure 4.18. The area of the shaded parallelogram is dxdz and the probability of 
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Ya 


| 


Fig. 4.18. The probability of (X, Y) falling in the oblique strip is dz times the convolution. 


(X, Y) falling into it is!® 


Pix < X <x4+dx,z<Z<z+dz)~ f(x, y)dxdz = fx(x) fy(z — x)dxdz. 
(4.109) 


Hence the probability of the strip is obtained by integrating over all x as 


P(z<Z<z+dz)~ | fx (x) fy (z - vd] dz, (4.110) 


and, since P(z < Z <z+dz) ~ fz(z)dz, Equation 4.110 implies Equation 4.108. 
| 


The convolution formulas for two special classes of random variables are worth 
mentioning separately: 


Corollary 4.5.1. [f the possible values of X and Y are the natural numbers i, j = 
0,1,2,..., then the pf. of Z = X + Y is given by 


k 
falk) = fx@fvk-i) fork =0,1,2,..., (4.111) 
i=0 


and if X and Y are continuous nonnegative random variables, then the p.df. of Z = 
X +Y is given by 


fz@) = [ Sx) fy @ — x)dx. (4.112) 


16 Recall that the symbol ~ means that the ratio of the expressions on either side of it tends 
to 1 as dx and dz tend to 0. 
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Example 4.5.6 (Sum of Two Binomial Random Variables). Let X and Y be indepen- 
dent, binomial r.v.’s with parameters n1, p and n2, p, respectively. Then Z = X + Y 
is binomial with parameters n; + 12, p since, by Equation 4.111 and Equation 2.33, 


k 
ny} ; 3 no = =p ie 
Ffztk) = aS ( olan ¢ . Jet ign k+i 


i=0 


k 
Ni n2 a es 
= » ( aie phn k 
5 I k-i 
i=0 


= (" ie ee fork =0,1,2,...,1+m2. (4.113) 

This result should be obvious even without any computation, since X counts the 
number of successes in n; independent trials and Y the number of successes in n2 
trials, independent of each other and of the first n; trials, and so Z = X + Y counts 
the number of successes in n; + 12 independent trials, all with the same proba- 
bility p. 

On the other hand, for sampling without replacement, the trials are not indepen- 
dent, and the analogous sum of two independent hypergeometric random variables 
does not turn out to be hypergeometric. 


Exercises 


Exercise 4.5.1. Two cards are dealt from a regular deck of 52 cards without replace- 
ment. Let X denote the number of spades and Y the number of hearts obtained. Are 
X and Y independent? 


Exercise 4.5.2. We roll two dice once. Let X denote the number of 1’s and Y the 
number of 6’s obtained. Are X and Y independent? 


Exercise 4.5.3. Let the random point (X,Y) be uniformly distributed on D = 
{ay :OS5xe 517,05 ys 1/3U{@,y isxslLi2<y< i} 
as in Example 4.4.5. Are X and Y independent? 


Exercise 4.5.4. Let X and Y be continuous random variables with density 


f(xy) xe*OtD ifx>0,y>0 (4.114) 
x,y)= ; 
- 0 otherwise. 


Are X and Y independent? 


Exercise 4.5.5. The indicator random variable!’ I, of an event A in any sample 
space S is defined by 


!7 Tn other branches of mathematics, A is called the characteristic function of A, but in prob- 
ability theory, that name is reserved for a different function. 
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1 ifseA 


Tats) = = 
A= V9 tseh 


(4.115) 


1. Prove that [4p = I,4lp. 

2. Prove that J4up = [4 + Ip — Tap. 

3. Prove that A and B are independent events if and only if 7,4 and /g are indepen- 
dent random variables. 


Exercise 4.5.6. Let the random point (X, Y) be uniformly distributed on the unit disc 
as in Example 4.4.7. Show that the polar coordinates R € [0, 1] and © € [0, 27] of 
the point are independent. 


Exercise 4.5.7. Alice and Bob visit the school library, each at a random time uni- 
formly distributed between 2PM and 6PM independently of each other, and stay 
there for an hour. What is the probability that they meet? 


Exercise 4.5.8. A point X is chosen at random on the interval [0, 1] and indepen- 
dently another point Y is chosen on the [1,2] interval. What is the probability that 
we can construct a triangle from the resulting three segments [0, X], [X, Y], [Y, 2] 
as sides? 


Exercise 4.5.9. We choose a point at random on the perimeter of a circle and then, 
independently another point at random in the interior of the circle. What is the prob- 
ability that the two points will be nearer to each other than the radius of the circle? 


Exercise 4.5.10. Let X be a discrete uniform r.v. on the set {000, 011, 101, 110} of 
four binary integers, and let X; denote the ith digit of X, fori = 1, 2,3. Show that 
X 1, X2, X3 are independent pairwise, but not totally independent. 

Can you generalize this example to more than three random variables? 


Exercise 4.5.11. Let X and Y be independent continuous, positive random variables 
with given densities fy and fy, with fy(x) = 0 forx <Oand fy(y) =O fory <0. 


1. Find formulas for the joint distribution function and density function of Z = 
XY. 

2. Find formulas for the joint distribution function and density function of Z = 
X/Y. 

3. Find the joint density of Z = XY if X and Y are both uniform on [0, 1]. 


Exercise 4.5.12. What is the probability that in ten independent tosses of a fair coin 
we get two heads in the first four tosses and five heads altogether? 


Exercise 4.5.13. Consider lightbulbs with independent, exponentially distributed 
lifetimes with parameter A = 1/(100 days). 


1. Find the probability that such a bulb survives up to 200 days. 
2. Find the probability that such a bulb dies before 40 days. 
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3. Find the probability that the bulb with the longest lifetime in a batch of 10 sur- 
vives to 200 days. 

4. Find the probability that the bulb with the shortest lifetime in a batch of 10 dies 
before 40 days. 


Exercise 4.5.14. Let X1, Xo,..., Xy, forn = 2,3,..., be i.i.d. random variables 
with common d.f. Fy. Find a formula for the joint df. Fy. z of Y = max{X1, X2,..., 
Xn} and Z = min{X,, X2,... , X»,} in terms of Fy. 


Exercise 4.5.15. Show that the p.d.f. of the sum S = 7; +7) of two 1.i.d. exponential 
r.v.’s with parameter A is given by 


ifs <0 


0 
FS) =) a.ods its > 0. 


(4.116) 


4.6 Conditional Distributions 


In many applications, we need to consider the distribution of a random variable under 
certain conditions. For conditions with nonzero probabilities, we can simply apply 
the definition of conditional probabilities to events associated with random variables. 
Thus, we make the following definition: 


Definition 4.6.1 (Conditional Distributions for Conditions with Nonzero Proba- 
bilities). Let A be any event with P(A) 4 0 and X any random variable. Then we 
define the conditional distribution function of X under the condition A by 


Fyja(x) =P(X <x|A)  forallx eR. (4.117) 


If X is a discrete random variable, then we define the conditional probability 
function of X under the condition A by 


fxja(x) =P(X =x|A)  forallx ER. (4.118) 


If X is a continuous random variable and there exists a nonnegative function fx| 4 
that is integrable over R and for which 


i Sxja(tdt = Fy\a(x), for all x, (4.119) 
—0o 


then fy), is called the conditional density function of X under the condition A. 
If Y is a discrete random variable and A = {Y = y}, then we write 


Fy\y (x,y) = P(X <x|Y =y) for all x € R and all possible values y of Y, 
(4.120) 


and call F’y|y the conditional distribution function of X given Y. 


118 4 Random Variables 


If both X and Y are discrete, then the conditional probability function of X given 
Y is defined by 


Sfxiy(@,y) =P(X =x|Y =y) for all possible values x and y of X and Y. 
(4.121) 


If X is continuous, Y is discrete, A = {Y = y}, and fx), in Equation 4.119 
exists, then fy 4 is called the conditional density function of X given Y = y and is 
denoted by fx\y(x, y) for all x € Rand all possible values y of Y. 


If X is a continuous random variable with density function fx),4, then, by the 
fundamental theorem of calculus, Equation 4.119 gives that 


Fria) = Fx), (4.122) 
wherever fx), is continuous. At such points, we also have 


P({x < X <x+dx}M A) 
fxja(x)dx ~ P(x < X <x+4+dx|A)= P(A) . (4.123) 


By the definitions of conditional probabilities and joint distributions, Equation 
4.121, for discrete X and Y, can also be written as 


f(x, y) 
fy(y) 


where f(x, y) is the joint p.f. of X and Y and fy(y) the marginal pf. of Y. 


fxyy@,y) = for all possible values x and yof X and Y, (4.124) 


Example 4.6.1 (Sum and Absolute Difference of Two Dice). In Example 4.4.1 we 
considered the random variables U = X + Y and V = |X — Y|, where X and Y 
were the numbers obtained by rolling two dice. Now, we want to find the values of 
the conditional probability functions fy\y and fy\y. For easier reference, we first 
reproduce the table of the joint probability function f(u, v) and the marginals (see 
Table 4.7). 

According to Equation 4.124, Table 4.8 of the conditional probability function 
fu\v U, v) was obtained from Table 4.7 by dividing each f(u,v) value by the 
marginal probability below it and, similarly, the table of the conditional probabil- 
ity function fy|y (u, v) was obtained by dividing each f(u, v) value by the marginal 
probability to the right of it (see Table 4.9). 

The conditional probabilities in these tables make good sense. For instance, if 
V=(|X —Y|=1,thnU = X+Ycanbeonly3 =14+2=2+1,5=2+3= 
34+2,7=34+4=443,9=44+5=544,0rll =5+6 = 6+5. Since 
each of these five possible U values can occur under the condition V = | in exactly 
two ways, their conditional probabilities must be 1/5 each, as shown in the second 
column of Table 4.8. 

Similarly, if U = X + Y = 3, then we must have (X, Y) = (1, 2) or (X,Y) = 
(2, 1) and, in either case V = |X — Y| = 1. Thus, fyjy(@G, 1) = 1 as shown for 
(u, v) = (3, 1) in Table 4.9. 4 
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For a continuous random variable Y, P(A|Y = y) and the conditional density 
Sx\y (x,y) are undefined since P(Y = y) = 0. Nevertheless we can define P(A|¥ = y) 
as a limit with Y falling in an infinitesimal interval at y, rather than being equal to y. 
For fx|y (x, y) we can use Equation 4.124 as a model, with f and fy reinterpreted 
as densities. 


Definition 4.6.2 (Conditional Probabilities and Densities for Given Values of a 
Continuous Random Variable). For a continuous random variable Y and any event 
A, we define 


P(AIY = y) = lim P(Aly <¥ <y +h), (4.125) 
h->0 


if the limit exists. In particular, if A = {X < x}, for any random variable X and any 
real x, then the conditional d.f. of X, given Y = y is defined as 


Fy\y (x,y) = lim, P(X <x|y <Y¥Y <y+h), (4.126) 
h->0 


if the limit exists, and, if X is discrete, then the conditional p.f. of X, given Y = y is 
defined as 


fxiy@,y) = lim POX =xly s¥ <y +h), (4.127) 
h-0 


if the limit exists. 


Table 4.7. The joint and marginal probability functions of U = X + Y and V = |X — Y|, for 
the numbers X and Y showing on two dice. 


u\v 0 I 2 3 4 5 fu) 
2 1/36 | 0 0 0 0 0 1/36 
3 0 | 2/36 | 0 0 0 0 2/36 
4 1/36 | 0 | 2/36] 0 0 0 3/36 
5 0 | 2/36 | 0 | 2/36} oO 0 4/36 
6 1/36 | 0 | 2/36] 0 | 2/36} 0 5/36 
7 0 | 2/36 | 0 | 2/36} 0 | 2/36 || 6/36 
8 1/36 | 0 | 2/36] Oo | 2/36} Oo 5/36 
9 0 | 2/36 | oO | 2/36} oO 0 4/36 
10 1/36 | 0 | 2/36] oO 0 0 3/36 
ll 0 | 2/36 | oO 0 0 0 2/36 
12 1/36 | 0 0 0 0 0 1/36 

fv(v) || 6/36 | 10/36 | 8/36 | 6/36 | 4/36 | 2/36 1 
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Table 4.8. The conditional probability function fy|y (u,v) of U = X+Y given V = |X —Y|, 
for the numbers X and Y showing on two dice. 


u\v || 0 1/23 |4)]5 
2 || 16; 0 | o | o | o | o 
3 0 | 1/5 0 | 0 | o 
4 |} 1/6} 0 | 1/4] 0 | 0 | oO 
5 0 | 1/5} 0 | 1/3} 0 | 0 
6 || 1/6] 0 | 1/4} 0 | 1/2] 0 
7 0 |1/5} 0/1/73} 0 | 1 
gs || 1/6] 0 | 1/4} 0 | 1/2 | 0 
9 0 | 1/5} 0 | 1/3} 0 | 0 
10 | 1/6 | o | 1/4 0 | 0 
ll 0 |1/5} 0 | 0 | o | o 
12} 1/6; 0 | o | o | o | o 


Furthermore, for continuous random variables X and Y with joint density f(x, y), 
and Y having marginal density fy(y), we define the conditional density fy|y by 


Table 4.9. The conditional probability function fy|y (u, v) of V = |X —Y| givenU =X+Y¥, 
for the numbers X and Y showing on two dice. 


u\v || 0 I 2/3 | 4 | 5 
2 too Be We We | og 
0 I 0 | 0 | 0 

i |) O° | 2) ® | oO ) a 

o | 1/2] 0 | 1/2} 0 | 0 

0 
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fy). 
f 0 
fxy@,y)=4 frO) PET (4.128) 


0 otherwise 


for all real x and y. 


Example 4.6.2 (Conditional Density for (X,Y) Uniform on Unit Disc). Let (X, Y) 
be uniform on the unit disc D = {(x, y) : x* + y? < 1} as in 4.4.7. Hence 


=, = if (x,y) €D 
fuv@.y= fro) 2/1—y2 - (4.129) 


0 otherwise. 


For a fixed y € (—1, 1) this expression is constant over the x-interval (—,/1 — y?, 
V1 —y?), and therefore, not unexpectedly, it is the density of the uniform distribu- 
tion over that interval. cos 


Note that fx)y can also be interpreted as a limit. Indeed, 
lim P(x < X <x+dx|y<Y <y+h) 
hor 


Pix < X <x+dx, y<Y<y+h) 


= lim 
hot Piy < Y <y+h) 
_ f(x, yhdx _ f(x, y)dx 
~ | — — ,y)dx, 4.130 
ned fr(yyh ay ey 


wherever f(x, y) and fy(y) exist and are continuous and fy(y) 4 0. Conversely, 
P(A|Y = y) can also be interpreted without a limit as 


P(A|IY = y) = PA) Fria) (4.131) 


fy) 


wherever fy|4(y) and fy(y) exist and are continuous and fy(y) 4 0, because then 


P(AN{y <¥ h 
im PAhey 2yenj—iin OS Se 
h—0t hoot =—P(y< Y <y+h) 


a ST ay A) 
= lim 
hot P(y < Y <y+h) 
P(A h P(A 
— yim PA fn4Odh _ P(A Fria) (4.132) 
hoot fy(y)h fr) 

Equation 4.131 is valid also when fy,4(y) and fy(y) exist and are continuous 
and fy(y) 4 0, but P(A) = 0, since in this case, AN {y < Y < y+h} C A, and 
so P(AN{y < Y < y+h}) = 0, which implies P(A|y < Y < y +h) = 0 and 
P(A|Y = y) = Oas well. Thus, Equation 4.131 reduces to 0 = 0. 

Equation 4.131 can be written in multiplicative form as 
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P(A|Y = y) fy(y) = P(A) fra). (4.133) 


This equation is valid also when fy(y) = 0, because in that case fy|4(y) = 0 as 
well. This fact follows from Equation 4.123 with Y in place of X: 


P({y < Y <y+dy}N A) 


fryjaQydy ~ P(A) 
P Bye? <a dy). Sele) _g (4.134) 
P(A) P(A) 


Similarly, Equation 6.143 too can be written in multiplicative form as 


fxiyv, y)fyQy) = f@, y). (4.135) 


This equation is valid when fy(y) = 0, as well, since fy(y) = 0 implies f(x, y) = 
0. Interchanging x and y, we also have 


fyix (x, y) fx) = f(x, y). (4.136) 


Returning to fx|y, we can see that, for any fixed y such that fy(y) ¥ 0, it 
is a density as a function of x. Consequently, it can be used to define conditional 
probabilities for X, given Y = y,as 


b b 
Pia<X <blY=y)= / fxiy@, y)dx = =~ | F(x, y)dx (4.137) 
a fyQ) a 


and, in particular, the conditional distribution function of X, given Y = y,as 


PY 1 By 
Fyjy (x, y) =f fxyy@, y)dt = —~— | f(t, y)dt. (4.138) 
—0o fy) Joo 


Using Definition 4.6.2, we can generalize the theorem of total probability (The- 
orem 3.5.2) as follows: 


Theorem 4.6.1 (Theorem of Total Probability, Continuous Versions). For a con- 
tinuous random variable Y and any event A, if fy|a and fy exist for all y, then 


P(A) = P(AIY = y) fy (y)dy (4.139) 


and, if X and Y are jointly continuous and fx\y and fy exist for all x, y, then 


fey = f fxiy (x, y) fy (dy. (4.140) 


Proof. Integrating both sides of Equation 4.133 from —oo to oo, we obtain Equation 
4.139, since tet fyjaQ)dy = 1 from Equation 4.119. 

Similarly, integrating both sides of Equation 4.135 with respect to y from —oo 
to oo, we obtain Equation 4.140. | 
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We have new versions of Bayes’ theorem as well: 


Theorem 4.6.2 (Bayes’ Theorem, Continuous Versions). For a continuous random 
variable Y and any event A with nonzero probability, if PLA|Y = y) and fy exist for 
all y, then 


P(A|Y = y) fy) 


fy|aQ) = [S P(AIY = y) fr (y)dy 


(4.141) 


Here fy is called the prior density of Y, and fy\, its posterior density, referring to 
the fact that these are the densities of Y before and after the observation of A. 
Furthermore, if X and Y are both continuous, fx,y and fy exist for all x, y, and 


Ix(x) £0, then 
fxiy, yy fyQ) 


fy\xQ, x) = . (4.142) 
[0 Fxy @, y) fy Ody 
Again, fy is called the prior density of Y, and fy|x its posterior density. 
Proof. From Equation 4.133 we get, when P(A) 4 0, 
P(AIY = y) fy (y) 
fyjaQ) = . (4.143) 


P(A) 


Substituting the expression for P(A) here from Equation 4.139, we obtain Equation 
4.141. 
Similarly, from Equations 4.135 and 4.136 we obtain, when fy (x) 40, 


fyyy@. fr) 


fyixQy, x) = (4.144) 

, fx) 
and substituting the expression for fy (x) here from Equation 4.140, we obtain Equa- 
tion 4.142. a 


Example 4.6.3 (Bayes Estimate of a Bernoulli Parameter). Suppose that X is a 
Bernoulli random variable with an unknown parameter P that is uniformly dis- 
tributed on the interval [0, 1]. In other words, let !8 


fxip(x, p) = p*(1— p)'* for x = 0, 1 (4.145) 
and 


1 for p € [0, 1] 
0 otherwise. 


fp(p) = (4.146) 


We make an observation of X and want to find the posterior density fp\x(p, x) of 
P. (This problem is a very simple example of the so-called Bayesian method of 


18 We assume 0° = 1 where necessary. 
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statistical estimation. It will be generalized to several observations instead of just 
one in Example 6.4.4.) 
By Equation 4.142, 


xy 1—x 
aa P) for p € [0, l]andx =0,1 
fP\x(p. x) = i p*(1— p)!-*dp (4.147) 


0 otherwise. 


For x = | we have 1, p*(1 — p)' *dp = is pdp = 1/2, and for x = 0, similarly, 
Jo p«— p)'*dp = J A — p)dp = 1/2. Hence, 


2p for p € [0, 1] andx = 1 
fp\x(p.x) = 420 —p) for p €[0, 1] andx =0 (4.148) 
0 otherwise. 


Thus, the observation changes the uniform prior density into a triangular poste- 
rior density that gives more weight to p-values near the observed value of X. 


Before closing this section, we want to present one more theorem, which follows 
from the definitions at once: 


Theorem 4.6.3 (Conditions for Independence of Random Variables). /f A is any 
event with P(A) #4 0 and X any random variable, then A and X are independent of 
each other if and only if 


Fya(x) = Fx (x) forallx —€R. (4.149) 


If X and Y are any random variables, then they are independent of each other if 
and only if 


Fyiy(x, y) = Fx(x) (4.150) 


for allx € Rand, for discrete Y, at all possible values y of Y and, for continuous Y , 
at all y values where fx\y (x, y) exists. 

If A is any event with P(A) 4 0 and X any discrete random variable, then A and 
X are independent of each other if and only if 


fia) = fx(x) — forallx ER. (4.151) 


If X and Y are any random variables, both discrete or both absolutely continu- 
ous, then they are independent of each other if and only if 


fxiy(x, y) = fx@) (4.152) 
for all x € Rand all y values where fy(y) #0. 


In closing this section, let us mention that all the conditional functions considered 
above can easily be generalized to more than two random variables, as will be seen 
in some exercises and later chapters. 
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Exercises 


Exercise 4.6.1. Roll four dice. Let X denote the number of 1’s and Y the number of 
6’s obtained. Find the values of the p.f. fy|y (x, y) and display them in a5 x 5 table. 


Exercise 4.6.2. Roll two dice. Let X and Y denote the numbers obtained and let 
Z=X+Y. 


1. Find the values of the p.f. fy|z(x, z) and display them in a 6 x 11 table. 

2. Find the values of the conditional joint p.f. f(y,y)\z(*, y, z) for z = 2 and show 
that X and Y are independent under this condition. 

3. Find the values of the conditional joint p-f. f(x,y)\z(, y, z) for z = 3 and show 
that X and Y are not independent under this condition. 


Exercise 4.6.3. As in Example 4.5.4, pick two random points X and Y independently 
and uniformly on the interval [0, 1] and let A denote the event that we can construct a 
triangle from the resulting three segments as its sides. Find the probability P(A|X = 
x) as a function of x and the conditional density function fy|4(x). 


Exercise 4.6.4. As in Example 4.6.3 let X be a Bernoulli random variable with an 
unknown parameter P,, which is uniformly distributed on the interval (0, 1). Suppose 
we make two independent observations X; and X2 of X, so that 


F(x1,X)|P 1, x2, p) = p21 — p71 for x,,x2 =0,1. (4.153) 


Find and graph fpj(x,,x,)(P, X1, X2) for all four possible values of (x1, x2). 


Exercise 4.6.5. Let (X, Y) be uniform on the triangle D = {(x, y) : 0 < x,0 < 
y,x +y < 1}. Find the conditional densities fy|y (x, y) and fy|x (x, y). 


Exercise 4.6.6. Let D = {(x, y): 0 <x,0 < y,x+y < 1} and (X, Y) have density 


60xy* if(x,y)€D 


f@,y= i (4.154) 


otherwise. 
(See Example 4.4.8.) Find the conditional densities fy)y(x, y) and fy|x (x, y). 


Exercise 4.6.7. Let (X, Y) be uniform on the open unit square D = {(x, y) : 0 < 
x<1,0<y<Il}andZ=xX+/Y.(See Example 4.4.9.) 


1. Find the conditional distribution functions Fy|z(x,z) and Fy)z(y, z) and the 
conditional densities fy|z(x, z) and fy|z(y, Zz). 
2. Let A be the event {Z < 1}. Find Fy)4(x) and fx\ (x). 


5 


Expectation, Variance, Moments 


5.1 Expected Value 


Just as probabilities are idealized relative frequencies, so are expected values anal- 
ogous idealizations of averages of random variables. Before presenting the formal 
definition, let us consider an example. 


Example 5.1.1 (Average of Dice Rolls). Suppose that we roll a die n = 18 times, and 
observe the following outcomes: 2, 4, 2, 1,5,5,4,3,4, 2, 6, 6,3, 4, 1,2,5, 6. The 
average of these numbers can be computed as 


7am Nee eer ea ree ae Bren ec ne 


average = 18 
So dee a 
18 18 18 18 is 18 
Jy pa! 68tinc, (5.1) 
= 18 


where f; stands for the relative frequency of the outcome /. 

Now ideally, since for a fair die the six outcomes are equally likely, we should 
have obtained each number 3 times, but that is not what usually happens. For large 
n, however, the relative frequencies are approximately equal to the corresponding 
probabilities p; = 1/6 and the average becomes close to 


6 6 
ba t= > 
i=1 i=l 


We use the first sum in Equation 5.2 as the paradigm for our general idealized 
average: 


ee 3.5 (5.2) 
b= > 3.5: ; 
6 


i=1 


Hale 
ale 


Definition 5.1.1 (Expected Value). For any discrete random variable X, writing 
pi = P(X = x;), we define the expected value, mean or expectation of X as 
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E(X) = >) pix, (5.3) 


provided, in case of an infinite sum, that the sum is absolutely convergent.! The 
summation runs over all i for which x; is a possible value of X. 

For any continuous random variable X with density f(x) we define the expected 
value or expectation of X as 


Boos fo xf (x)dx, (5.4) 


—0oo 
provided that the improper integral is absolutely convergent. 
Remarks. 


1. We did not give the definition for general random variables; that is a topic taken 
up in graduate courses. We shall assume, without further mention, that the ran- 
dom variables we discuss are either discrete or absolutely continuous. 

2. Because of the occurrence of infinite sums and integrals, E(X) does not exist 
for some random variables, as will be illustrated shortly. These cases are rare, 
however, in real-life applications. 

3. The expected value of a random variable X, is not necessarily a possible value 
of X, despite its name; see, for instance, Example 5.1.1, but in many cases it can 
be used to predict, before the experiment is performed, that a value of X close to 
E(X) can be expected. 

4. The expected value of a random variable X depends only on the distribution of 
X and not on any other properties of X. Thus, if two different random variables 
have the same distribution, then they have the same expectation as well. For 
instance, if X is the number of H’s in n tosses of a fair coin and Y is the number 
of T’s, then E(X) = E(Y). 

5. In the discrete case E(X) can also be written as 


E(XX)= Do xf), (5.5) 


xi f (x)>0 


where f is the p-f. of X. 
6. E(X) is often abbreviated as jz or Ly. 


Example 5.1.2 (Bernoulli Random Variable). Recall that X is a Bernoulli random 
variable with parameter p (see Definition 4.1.4), if it has two possible values: 1 and 
0, and P(X = 1) = pand P(X = 0) =g=1-p. 

Hence, E(X) = |p +0q = p. 


! Requiring absolute convergence is necessary, because if the sum were merely conditionally 
convergent, then the value of E(X) would depend on the order of the terms. Similarly, in 
the continuous case, if the integral were merely conditionally convergent, then E (X) would 
depend on the manner in which the limits of the integral tend to oo. 
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Example 5.1.3 (Expected Value of Uniform Random Variables). Let X be uniform 
over an interval [a, b], that is, let it have p.d_f. 


1 : ; 
om aes ifa<x <b (5.6) 


0 ifx <aorx>b. 


Then its expected value is given by 


1 x2I’ 


ioe) b x 
ea) =[ sfooax = | ia hone 


Example 5.1.4 (Expected Value of Exponential Random Variables). Let T be an ex- 
ponential r.v. with parameter A. (See Definition 4.2.3.) Then its p.df. is f(t) = Ae" 
for t > 0, and so 


a+b 
= : cL 
5 (5.7) 


a 


oe} [o,@) 

eq) = | tf (t)dx -|/ ne ade: (5.8) 
—0oo 0 
Integrating by parts with wu = t and dv = Ae~ dt, we get 
At |oo a At ae i 

E(T) = -te™ “dt =0- ==, 5.9 
(T) = te + [ e air (5.9) 
4 


In Examples 5.1.3 and 5.1.1, E(X) was at the center of the distribution. This 
property of E(X) is true in general, as explained in the following observation and 
subsequent theorem. 

The expected value is a measure of the center of a probability distribution, be- 
cause the defining formulas are exactly the same as the corresponding ones for the 
center of mass in mechanics for masses on the x-axis (or, more generally, for the 
x-coordinates of masses in space), with p; as the mass of a point at x; and f(x) as 
the mass density for a smeared out mass distribution. Thus, if we were to cut out the 
graph of the p.f. or p.d-f. of ar.v. X from cardboard, then it would be balanced if sup- 
ported under the point x = F(X). Ina similar vein, the following theorem confirms 
that E(X) yields the obvious center for a symmetric distribution. 


Theorem 5.1.1 (The Center of Symmetry Equals E(X)). [f the distribution of a 
random variable is symmetric about a point a, that is, the pf. or the p.df. satisfies 
f(a—x) = f(a+x) for all x, and E(X) exists, then E(X) =a. 


Proof. We give the proof for continuous X only; for discrete X the proof is similar 
and is left as an exercise. 
We can write 
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eo) = [- xf (x)dx = [ -atafinrdx 


— [ (x — a) f (x)dx +f af (x)dx, (5.10) 


where the first integral on the right-hand side will be shown to be 0, and so 


eo= [- apddr =a f f(xjdx =a. (5.11) 


The integral of (x — a) f (x) may be evaluated as follows: 


/ (x — a) f(x)dx = i (x — a) f (x)dx + (x —a) f (x)dx, (5.12) 


where in the first integral on the right-hand side we substitute u = @ — x and in the 
second integral u = x — a. Hence 


lee) 0 lee) 
/ (ay fords = [ uf(a—wdu+ f uf(a+u)du 
—oo lee) 0 


= [ utfatu- fla-widu =o, (5.13) 
0 


where the last step follows from the symmetry assumption. a 


If a random variable is bounded from below, say by 0, and we know its expected 
value, then only a small fraction of its values can fall far out on the right, that is, the 
expected value yields a bound for the right tail of the distribution: 


Theorem 5.1.2 (Markov’s Inequality). [f X is a nonnegative random variable with 
expected value yz and a is any positive number, then 


P(X >a)<=. (5.14) 
a 


Proof. We prove the statement only for continuous X with density f. Then 
[o.@) a lo) 
w= «f@ydx = | xplords + f xf (x)dx 
0 0 a 


> [ xteoas > af f(x)dx = aP(X > a), (5.15) 


from which Equation 5.14 follows at once. a 


The main use of Theorem 5.1.2 is in proving another inequality, for not neces- 
sarily positive random variables, in the next section, which, in turn, will be used for 
a proof of the so-called law of large numbers. 

In addition to providing a measure of the center of a probability distribution, 
the expected value has many other uses, as will be discussed later. For now, we just 
describe its occurrence in gambling games. 
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Example 5.1.5 (Total Gain in Dice Rolls). Consider the same game as in Example 
5.1.1 with the same outcomes and assume that whenever the die shows the number 
i, we win i dollars. In that case our total gain will be $65, which can be written as 
18 x average. Similarly, in the ideal situation the total gain would be 18 x3.5 = $63. 


4 


Thus, in general games, our ideal gain is n F(X), where n is the number of times 
we play. (Mathematically, this result follows from Theorem 5.1.5 below.) Hence, 
E(X) is a measure of the fairness of a game, and a game is called fair if E(X) = 0. 

The dice game described above is very unfair, and we may ask the question how 
much should we be required to bet each time to make the game fair. Clearly, the 
answer is $3.50, that is, if we lose this bet each time and win i dollars with probability 
1/6 fori = 1,2,... ,6,then 


6 
E(X — 3.50) = ¥ : (i — 3.50) = 0, (5.16) 
i=l 


and the game is turned into a fair one. 

In general, if we have an unfair game with E(X) > O, then paying an entrance 
fee of E(X) dollars each time will turn the game into a fair one. (This follows from 
Corollary 5.1.1 below.) 


Example 5.1.6 (Roulette). In Nevada roulette, a wheel with 38 numbered pockets is 
spun around and a ball is rolled around its rim in the opposite direction until it falls 
at random into one of the pockets. On a table, the numbers of the pockets are laid 
out and the players can bet on various combinations to come up, with predetermined 
payouts. 18 of the numbers are black and 18 are red, while two are green. One of the 
possible betting combinations is that of betting on red with a $1 payout for every $1 
bet (that is, if red comes up, you keep your bet and get another dollar, and if black or 
green comes up, you lose your bet). Compute the expected gain from such a bet. 

If we denote the amount won or lost in a single play of $1 by X, then P(X = 
1) = 18/38 and P(X = —1) = 20/38. Thus, 

18 20 


E(X) = -1 -(-1) © —.0526 = —5.2 ts. 5.1 
(X) 38 Tage ) 526 6 cents (5.17) 


This result means that in the long run the players will lose about 5.26 cents on every 
dollar bet. 

The house advantage is set up to be about 5% for the other possible betting com- 
binations as well. 


Example 5.1.7 (The Saint Petersburg Paradox). Some gamblers in Saint Petersburg, 
Russia in the 18th century devised a betting scheme for even money bets, such as 
betting on red in roulette. First you bet | unit and if you win you quit. If you lose, 
you bet 2 units on the next game. If you win this game, then you are ahead by | unit, 
because you have lost | and won 2, and you quit. If you lose again, then you bet 4 on 
the third game. If you win this time, then you are again ahead by 1, since you have 
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lost | + 2, but have won 4. If you lose, you bet 8, and so on. Thus, the claim was 
that, following this scheme, you are assured of winning | unit. 

The expected gain is also | unit: If X denotes the net gain and n the number of 
plays till you win and stop, then, according to the above discussion, X = | for any 
n. If p < 1 denotes the probability of winning in any trial and g = 1 — p is the 
probability of losing, then P(first win occurs on the nth play) = g”~!p. Hence, by 
the sum formula for a geometric series, 


[o.e) 
E(X) = So q"'p-1= 4 =1, (5.18) 
n=1 


On the other hand, roulette is an unfavorable game, so how can it be possible to 
beat it? The answer is simple: it cannot be beaten. In this game you need an infinite 
amount of money to be assured of winning, since it is quite possible that you may 
need to bet 2” units, with n arbitrarily large. 

If the bet size is capped, however, either by the house or by the player’s capital, 
then the scheme has no advantage over any other scheme. Indeed, if the maximum 
bet size is 2", then 


N oo 
EQ)=) g'pi= >) a pO =) 
n=1 n=N+1 
1—q% NQN—1 
Il—q l—q 


This result is exactly what we would expect, since | is the expected value for an 
overall win and 2" is the last bet in the case of a string of losses, which has proba- 
bility g% , and so 2“q% is the expected loss. Notice that in the case of a fair game, 
gq = 1/2 and E(X) = 1—2(1/2)% =0, that is, a fair game remains fair under this 
doubling scheme as well. 

Another variant of the Saint Petersburg scheme provides an example of a random 
variable with infinite expectation. For the sake of simplicity we assume that we are 
betting on H in independent tosses of a fair coin. Again, we play until the first H 
comes up, but this time we bet even more: (7 + 1)2”! units on the nth toss if the 
first n — 1 tosses resulted in 7, forn = 1,2,... . If the first H occurs on the nth toss, 
which has probability 1/2”, then the gain is (see Exercise 5.1.5) 


n-1 
gt - G+ 1)2'-! = 2" (5.20) 
i=l 
and so 
[oe] [oe] 
EQ) =o 50°? => i=o (5.21) 
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Next, we present a surprising, but very useful, theorem, which enables us to com- 
pute the expectation of a function Y = g(X) of ar.v. X without going through the 
laborious process of first finding the distribution of Y. 


Theorem 5.1.3 (Expectation of a Function of a Random Variable). Let X be any 
random variable, and define a new random variable as Y = g(X), where g is any” 
function on the range of X. 

If X is discrete, then writing pj = P(X = x;), we have 


E(Y) =) pig), (5.22) 


provided that the sum is absolutely convergent. (The summation runs over all i for 
which x; is a possible value of X .) 
If X is continuous with p.df. fx, then 


E(Y) -| &(X) fx (x) dx, (5.23) 


provided the integral is absolutely convergent. 


Before giving the proof, let us compare the evaluation of E(Y) by the theorem 
with its evaluation from the definition, on a simple example. 


Example 5.1.8 (Expectation of g(X) = |X| for a Discrete X). Let the pf. of X be 
given by 


1/8 ifx=-1 
3/8 ifx =0 
x)= 5.24 
Jaks) 3/8 ifx=—1 eee) 
1/8 ifx =2 
and let Y = |X|. Then 
3/8 ify=0 
fry)= 41/2 ify=1 (5.25) 
1/8 ify=2 
Hence, by Definition 5.1.1, 
EY) =3 o+: oe aS (5.26) 
~ 8 2 $0 Ae 


On the other hand, by Theorem 5.1.3, 


1 a 3 1 3 
BU lt el (5.27) 


2 Actually, g must be a so-called measurable function. This restriction is discussed in more 
advanced texts; all functions encountered in elementary calculus courses are of this type. 
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Thus, we see that the difference between the two evaluations is that the two terms 
in Equation 5.27 that contain |—1| and |1| are combined into a single term in Equation 
5.26. This is the sort of thing that happens in the general discrete case as well. In the 
evaluation of fy we combine the probabilities of various x-values (see the proof 
below), which we treat separately when using the theorem. In more complicated 
cases, it can be difficult to find the x-values that need to be combined, but treating 
them separately is very straightforward. 


Proof (of Theorem 5.1.3). 1n the discrete case, we evaluate )~ pj; g(x;) in two stages: 
first, we sum over all x; for which g(x;) is a fixed value yx of Y, and then sum over 
all k for which y,; is a possible value of of Y. Thus, assuming absolute convergence, 


yreao= > OY pats = pe nn 


kK ig (xi=Yr kK Nig Qi=Ye 


=> PW = yay = 0 fr One = EY). (5.28) 
k k 


For continuous X, the general proof is beyond the scope of this book and is 
therefore omitted.> However, if g is one-to-one and differentiable, then the proof is 
simple, and goes like this: 

By Definition 5.1.1, 


E(Y)= / yfy (y)dy, (5.29) 
where fy(y) is given by Theorem 4.3.1 as 
d 
fx(g'()) Fe'o)| = ee if y = g(x) for some 
dy Ig’(x)| 
fy (y) = xe Range(X) (5.30) 
0 otherwise. 


Thus, changing variables in Equation 5.29 from y = g(x) to x = g~!(y), we get 


EW) = | Qe 
—0o le’(x)| 


as stated in Equation 5.23. im) 


dy 
dx 


xe | gix)fx@dx, (6.31) 


Example 5.1.9 (Average Area of Circles). Assume that we draw a circle with a ran- 
dom radius R, uniformly distributed between 0 and some constant a. What is the 
expected value of the area Y = 2 R* of such acircle? 

Now, 


3 The proof would require taking limits of approximations of the given continuous r.v. by 
discrete r.v.’s with a finite number of values. 
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l/a if0<r<a 
r= 5:32 
Ie) fF otherwise ) 
and g(r) = mr*. Thus, substituting into Equation 5.23 gives 
“41 m| xa? 
E(Y)= mr-—dr = — : (5:33) 
0 a 3a r 3 


It is quite surprising that, though the mean radius is half of the maximal radius, 
the mean area is one third of the maximal area. 4 


Theorem 5.1.3 has a frequently used application to linear functions: 


Corollary 5.1.1 (Expectation of a Linear Function). For any random variable X 
such that E(X) exists and for any constants a and b, 


E(aX +b) =aE(X) +b. (5.34) 


Proof. We give the proof for continuous X only; for discrete X the proof is similar 
and is left as an exercise. 
In Equation 5.23, let g(x) = ax +b. Then 


Elax+6)=f (ax+bfordr=a f rfardr +b f f(x)dx 


=aE(X)+b. (5.35) 


Example 5.1.10 (Average Temperature). Assume that at noon on April 15th at a cer- 
tain place, the temperature C is a random variable (that is, it varies randomly from 
year to year) with an unknown distribution but with known mean E(C) = 15° Cel- 
sius. If F = 1.8C +32 is the corresponding temperature in Fahrenheit degrees, then, 
by Corollary 5.1.1, 


E(F) = 1.8E(C) + 32 = 59° Fahrenheit. (5.36) 


Thus, the expected temperature transforms in the same way as the individual 
values do. 


Example 5.1.11 (Expected Value of a Geometric Random Variable). Let X be geo- 
metric with parameter p. (See Definition 4.1.7.) We can obtain E (X) by computing 
E(X — 1) in two ways: 

By Theorem 5.1.3, 


Cc CO CO 
E(X -1)=)0k-Wpq"! = 0 jpg! =>. jpg’ | =qE(X) (5.37) 
k=2 j=l j=l 
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and, by Corollary 5.1.1, 


E(X —1) = E(X)-1. (5.38) 
Thus, 
gE(X) = E(x) -1, (5.39) 
(d —q)E(X) = 1, (5.40) 
and so 
roo, (5.41) 
Pp 
4 


A theorem analogous to Theorem 5.1.3 holds for functions of several variables 
as well: 


Theorem 5.1.4 (Expectation of a Function of Several Random Variables). Let 


X1, X2,..., Xn be any random variables, and define a new random variable as 
Y = g(X1, X2,... , Xn), where g is any function on R". If f denotes the joint pf. 
or p.df. of X,, X2,..., Xn, then in the discrete case 

E(Y) = 90 +++} ge, x2... Xn) f 1X2, + in), (5.42) 
where the summations run over all x1, X2,...,Xn such that P(X; = xj) 4 0 for 
i=1,2,...,n, and in the continuous case 


Em)= ff B(X1, X2,--- Xn) f (1, X2,--. » Xn)dx1dx2-++dxXn, (5.43) 


provided that the sum and the integral are absolutely convergent. 


We omit the proof. (In the discrete case it would be similar to the proof of Theo- 
rem 5.1.3 and in the continuous case it would present the same difficulties.) 


Example 5.1.12 (Expectation of the Distance of a Random Point from the Center of 
a Circle). 

Let the random point (X, Y) be uniformly distributed on D = {(x, y) : x?+y? < 
1}. (See Example 4.4.7.) Let R = VX? 4+ Y? and find E(R). 


Then 
1 
E(R) = — 2 2dxdy. 5 44 
(R) = ff ys? + axay (5.44) 


Changing to polar coordinates, we get 


iT 20 1 1 r3 
2 
E(R) = — rdrd@ = —2n —| =<. (5.45) 
T Jo 0 a 3 


1 
0 
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Theorem 5.1.4 has the following very important consequence: 


Theorem 5.1.5 (Expectation of a Sum of Two Random Variables). For any two 
random variables X and Y whose expectations exist, 


E(X¥+Y)=E(X)+ E(Y). (5.46) 


Proof. We give the proof for continuous (X, Y) only; for discrete (X, Y) the proof 
is similar and is left as an exercise. 
By Theorem 5.1.4, with X; = X, X¥2 = Y and g(X + Y) = X + Y, we have 


eax ty= | / Gani Giidedy 


=f *(f_ rennay)art fy (f fo.nax)ar 


=| feeds + | yfy (y)dy = E(X) + E(Y). (5.47) 


| 
Repeated application of Theorem 5.1.5 and Equation 5.34 leads to 


Corollary 5.1.2 (Expectation of a Linear Function of Several Random Vari- 
ables). For any positive integer n and any random variables X,, X2,... , Xn with 
finite expectations, and constants a,,a2,... , Qn, 


E (S01) = Sane: (5.48) 


i=1 i=l 


Example 5.1.13 (Expectation of Binomial Random Variables). Recall that a random 
variable X is called binomial with parameters n and p, (see Definition 4.1.5) if it has 


p-f. 
fin, p) = ("\orar forx =0,1,...,n. (5.49) 
x 


Now, X counts the number of successes in 7 trials (or the number of good items 
selected in sampling with replacement). It can be written as a sum of n identical 
(and independent; but that is irrelevant here) Bernoulli random variables X; with 
parameter p. Indeed, let X; = 1 if the ith trial results in success and 0 otherwise. 
Then X = )~'_, X;, because the number of 1’s in the sum is exactly the number of 
successes and the rest of the terms equal 0. Hence 


E(X)=E (x x) =) §Gy=) pam. (5.50) 
i=l i=l i=l 
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This result can, of course, be obtained directly from the definitions as well (see 
Exercise 5.1.15), but the present method is much simpler and explains the reason 
behind the formula. 


Example 5.1.14 (Hypergeometric Random Variable). A hypergeometric random 
variable X counts the number of successes, that is, the number of good items picked, 
if we select a sample of size n without replacement from a mixture of N good and 
bad items. (See Example 3.2.4.) If p stands for the fraction of good items in the lot 
and gq = | — p the fraction of bad items, then the p.f. of X is 


(264) 


f(x;n,N, p) == 
(7) 


A direct evaluation of E(X) would be quite difficult from here, but we can do 
the same thing that we did in the binomial case. Again, if X; is a Bernoulli random 
variable for each i, such that X; = 1 if the 7th trial (that is, the 7th choice) results in 
success and X; = 0 otherwise, then X = )~/_, X;. Now P(X; = 1) = p for every i, 
because if we do not know the outcomes of the previous choices, then the probability 
of success on the ith trial is the same as for the first trial. Thus Equation 5.50 also 
applies now and gives the same result: F(X) = np. 


for max(O,n — Nq) <x <min(n, Np). (5.51) 


Theorem 5.1.6 (Expectation of the Product of Two Independent Random Vari- 
ables). For any two independent random variables X and Y whose expectations 
exist, 


E(XY) = E(X)E(Y). (5.52) 


Proof. We give the proof for continuous (X, Y) only; for discrete (X, Y) we would 
just have to replace the integrals by sums. 

By the assumed independence, f(x, y) = fx(x) fy (y). By Theorem 5.1.4, with 
X; = X,X2=Y and g(XY) = XY, we have 


E(XY) = / / woe / / xy f(x) fy (y)dxdy 


=i, feds f yfy (y)dy = E(X)E(Y). (5.53) 


Note that in the preceding proof, the assumption of independence was crucial. 
For dependent random variables Equation 5.52 usually does not hold. 

A similar proof leads to the analogous theorem for more than two random vari- 
ables. 


Theorem 5.1.7 (Expectation of the Product of Several Independent Random 
Variables). For any positive integer n and any independent random variables 
X1, X2,... , Xn whose expectations exist, 


E (1) = [[£%. (5.54) 
i=1 i=1 
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Exercises 


Exercise 5.1.1. From a regular deck of 52 playing cards we pick one at random. Let 
the r.v. X equal the number on the card if it is a numbered one (ace counts as 1) and 
10 if it is a face card. Find F(X). 


Exercise 5.1.2. 4 indistinguishable balls are distributed randomly into 3 distinguish- 
able boxes. (See Example 2.5.3.) Let X denote the number of balls that end up in the 
first box. Find E(X). 


Exercise 5.1.3. Find E (7) for ar.v. T with density 


0 ift <0 
Mte™ ift>0. 


fM= (5.55) 


(This is the density of the sum of two independent exponential r.v.’s with parameter 
4 > 0.) 


Exercise 5.1.4. In the game of roulette (Example 5.1.6) a winning bet on any single 
number pays 35:1. Find E(X), where X denotes the gain from a bet of $1 on a single 
number. 


Exercise 5.1.5. Prove Equation 5.20. Hint: Let g(x) = ¥o x! = (x” —x)/(x — 1). 
First, compute g’(x) from both expressions for g(x) and set x = 2. 


Exercise 5.1.6. A random variable X with p.df. f(x) = C/m){1/d + x?)] for any 
real x, is called a Cauchy r.v. Show that 


1. this f is indeed a p.df., 
2. E(X) does not exist, because the integral of x f(x) is not absolutely convergent. 


Exercise 5.1.7. Prove Theorem 5.1.1 for discrete X. 


Exercise 5.1.8. Toss a fair coin repeatedly until HH or TT comes up. Let X be the 
number of tosses required. Find F(X). (See Exercise 4.1.7 and Example 5.1.11.) 


Exercise 5.1.9. Let X be an exponential r.v. with parameter 4. (See Definition 4.2.3.) 
Find E(X?). 


Exercise 5.1.10. Let X be uniform over the interval (0, 1). Find E(|X — (1/2)]). 


Exercise 5.1.11. Let X be uniform over the interval (0, 1). Show that E(1/X) does 
not exist. 


Exercise 5.1.12. Prove Equation 5.34 for discrete X. 


Exercise 5.1.13. Prove Equation 5.34, for continuous X and a + 0, directly from 
Example 4.3.1 without using Theorem 5.1.3. 
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Exercise 5.1.14. Prove Theorem 5.1.5 for discrete X. 


Exercise 5.1.15. Prove E(X) = np for a binomial r.v. directly from Equation 5.49 
and Definition 5.1.1. 


Exercise 5.1.16. Let the random point (X,Y) be uniformly distributed on D = 
{(x, y):x?2+ y? < 1} and let Z = X? + Y?. Find E(Z). 


Exercise 5.1.17. Let the random point (X,Y) be uniformly distributed on D = 
{(x, y) : x? + y? < 1}. Does Equation 5.52 hold in this case? 


Exercise 5.1.18. Let X be a discrete uniform r.v. on the set {—1, 0, 1}, and let Y = 
X?. Show that X and Y are not independent but E(XY) = E(X)E(Y) nevertheless. 


Exercise 5.1.19. Let the random point (X,Y) be uniformly distributed on the unit 
square D = {(x,y): 0 < x < 1,0 < y < 1}, as in Example 4.44, and let 
Z = X*+ Y*. Find E(Z). 


Exercise 5.1.20. Let the random point (X,Y) be uniformly distributed on the unit 
square D = {(x,y): 0 < x < 1,0 < y < 1}, as in Example 4.44, and let 
Z=xX+/Y.Find E(Z). 


Exercise 5.1.21. Give an alternative proof for the expectation of a geometric r.v. 
X (Example 5.1.11), based on the observation that E(X) = bar kpqk-! = 


Px ye torleg <1, 


Exercise 5.1.22. Let X be a hypergeometric random variable (the number of good 
items in a sample, see Example 5.1.14), and let Y = n — X be the number of bad 
items in the same sample. Find E(X — Y). 


5.2 Variance and Standard Deviation 


As we have seen, the expected value gives some information about a distribution 
by providing a measure of its center. Another characteristic of a distribution is the 
standard deviation, which gives a measure of its average width. 

The first idea most people have for an average width of the distribution of a 
random variable X, is the mean of the deviations X — jz from the mean w = E(X), 
that is, the quantity F(X — .). Unfortunately, however, E(X — w) = E(X)-yu =0 
for every r.v. that has an expectation, and so this is a useless definition. We must do 
something to avoid the cancellations of the positive and negative deviations. 

So next, one could try E(|X — j|). Though this definition does provide a good 
measure of the average width, it is generally difficult to compute and does not have 
the extremely useful properties and the amazingly fruitful applications that our next 
definition has. 
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Definition 5.2.1 (Variance and Standard Deviation). Let X be any random vari- 
able with mean 4 = E(X). We define its variance and standard deviation as 


Var(X) = E((X — 2)”) (5.56) 
and 
SD(X) = J Var(X), (5.57) 


provided that E((X — L)”) exists as a finite quantity. 


Note that (X — jw)? > 0, and so here the cancellations implicit in E(X — yj) are 
avoided. Moreover, squaring X — y introduces a change of units and the square root 
in SD(X) undoes this. For instance, if X is a length, then Var(X) is area, but SD(X) 
is length again. 

S D(X) is often abbreviated as o or ox. 


Example 5.2.1 (Roll of a Die). Let X denote the number obtained in a roll of a die, 
that is, P(X =i) = 1/6 fori = 1,2,... ,6.Then uw = 3.5, and 


6 
Var(X) = 2 : :G=35)- 


i=l 
= ; [ 2.5)? + (-1.5)? + (0.5)? + 0.5)? + (1.5)? + @.5)"] ~ 2.9167 
(5.58) 
and 


SD(X) © 1.7078. (5.59) 


In Figure 5.1 we see the graph of the p.f. with 2 and yz +o0 marked on the x-axis. 
As can be seen, the distance between yz — o and w+ o is indeed a reasonable 
measure of the average width of the graph. 


0.27 


uo u uto 


Fig. 5.1. Graph of the probability function of a discrete uniform random variable over 
{1,2,... ,6} with yz and w +o indicated. 
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Fig. 5.2. Graph of the p.d.f. of a uniform random variable over [0,1] with 4 and uw +o 
indicated. 


Example 5.2.2 (Variance and Standard Deviation of a Uniform Random Variable). 
Let X be uniform over the interval [a, b], that is, have p.df. 


1 
Gf 

i= bed ifa<x<b (5.60) 
0 ifx <aorx>b. 


Then « = (a+ b)/2 and 


1 b (b — a)? 
Var(X) = —— | (x — w)°dx = ——— 5.61 
aX) = fe Pax = SO (561) 
and 
sp(x) = 2=4 (5.62) 
Bafa 

In Figure 5.2 we show the graph of the uniform p.d.f. over the [0, 1] interval, 
with jz and +o marked on the x-axis. 4 


Next, we present several useful theorems. 


Theorem 5.2.1 (Zero Variance). For any random variable X such that Var(X) ex- 
ists, Var(X) = 0 if and only if P(X = c) = 1 for some constant c. 


Proof. We give the proof for discrete X only. 
If P(X = c) = 1 for some constant c, then f(x) = 0 for all x ~ c, and so 


b= EOS Y ofae lao, (5.63) 


xif (x)>0 


Similarly, 
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Var(X) = E («x = »)°) = oe (x —c)* f(x) =0. (5.64) 


xi f (x)>0 


Conversely, assume that Var(X) = 0. Then every term on the left-hand side of 


>> @— mf) =0 (5.65) 


xif (x)>0 


is nonnegative and must therefore be 0. So, if f(x) > 0, then we must have x — wy = 
0, that is, x = mw. Forx ~ pw, (x — jw)? # 0, and so f(x) = O must hold. Since 


ve ro>o f &) = 1, and the only possible nonzero f(x) is f(w), we get f(u) = 1, 
or, in other words, P(X = w) = 1. | 


Theorem 5.2.2 (Variance and Standard Deviation of a Linear Function of a 
Random Variable). /f X is a random variable such that Var(X) exists, then, for 
any constants a and b, 


Var(aX + b) = a?Var(X) (5.66) 
and 
SD(aX +b) = |a|SD(X). (5.67) 
Proof. By Equation 5.34, E(aX + b) =ayu +b, and so 
Var(aX +b) =E es Sie »)?| =5 [ax = »)| 
= WE [ox = 1] = a?Var(X). (5.68) 
Equation 5.67 follows from here by taking square roots. a 


Example 5.2.3 (Standardization). In some applications, we transform random vari- 
ables to a standard scale in which all random variables are centered at 0 and have 
standard deviations equal to |. For any given r.v. X, for which jz and o exist, we 
define its standardization as the new r.v. 


= 
pg (5.69) 
o 
Then indeed, by Equation 5.34, 
xX 1 
E(Z) = ( “) So peje SG (5.70) 
o Oo oO o 
and, by Equation 5.67, 
1 
SD(Z) = [= |som = 1. (5.71) 
o 
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Theorem 5.2.3 (An Alternative Formula for Computing the Variance). /f X is a 
random variable such that Var(X) exists, then 


Var(X) = E(X’) — py”. (5:72) 
Proof. 
Var(X) = E ((X — u)?) = E(x? - 2x + yu?) 
= E(X*) — 2wE(X) + Ww? = E(X’) — p’. (5.73) 
| 


Example 5.2.4 (Variance and Standard Deviation of an Exponential Random Vari- 
able). Let T be an exponential r.v. with parameter 7. We use Equation 5.72 to com- 
pute the variance. Then 


[o,@) CO 
E(T’) = / t? f(t)dt = / rie “de. (5.74) 
—0o 0 
Integrating by parts twice as in Example 5.1.4, we obtain 
2 2 
E(T*)= vs (5.75) 
Hence, 
2 1 1 
and so 
1 
SD(T) = x" (5.77) 


Theorem 5.2.4 (Variance of the Sum of Two Independent Random Variables). 
For any two independent random variables X and Y whose variances exist, 


Var(X + ¥) = Var(X) + Var(Y). (5.78) 


Proof. Writing E(X) = wy and E(Y) = py, we have E(X + Y) = wy + wy, and 
so 


Var(X + ¥) = ((X+Y¥ - (ux +uy))") = E ((X - ux) + & - ny))’) 
= E ((X wx)? +20X — wx) — wy) + - By)”) 


= E ((X ~ wx)”) + 2E(X — wx) — wy) + E (W - ny)’) 
= Var(X) + 0+ Var(Y) = Var(X) + Var(Y). (5.79) 
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The reason the middle term is 0 is due to the independence of X and Y and, 
consequently, the independence of X — jzy and Y — wry, which implies 


E((X — wx)(Y — py)) = E(X — wx) E(Y — py) 
= (E(X) — wx)(E(Y) — wy) = 0. (5.80) 


0 
Theorem 5.2.4 can easily be generalized to more than two random variables: 


Theorem 5.2.5 (Variance of Sums of Pairwise Independent Random Variables). 
For any positive integer n and any pairwise independent random variables X,, X2, 
...,X, whose variances exist, 


Var (x) = 5 Var(X;). (5.81) 
i=l i=l 


We omit the proof. It would be similar to that of Theorem 5.2.4, and because 
each mixed term involves the product of only two factors, we do not need to assume 
total independence, pairwise independence is enough. 

It is this additivity of the variance that makes it, together with the SD, such a 
useful quantity; a property that other measures of the spread of a distribution, such 
as E(|X — pr|), lack. 

The preceding results have a corollary that is very important in statistical sam- 


pling: 


Corollary 5.2.1 (Square Root Law). For any positive integer n, consider n pair- 
wise independent, identically distributed random variables X,,X2,...,Xn with 
mean jt and standard deviation o . Let Sy, denote their sum and Xy, their average, 
that is, let 


n 
S, = 2 x (5.82) 
i=! 
and 
= 1 
Xn = — Aa (5.83) 
i=1 
Then 
E(S,) =n and SD(S,) = Jno, (5.84) 
and 


—- — (oy 
E(Xn) = and SD Kn) = (5.85) 
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Example 5.2.5 (Variance and SD of a Bernoulli Random Variable). If X is a Bernoulli 
random variable with parameter p, then E(X) = p and 


Var(X) =E ((x -p)’) = p(l— p)’+(1— p) O- p)* = p— p* = pa, 
(5.86) 


and 


SD(X) = Jaq. (5.87) 


Example 5.2.6 (Variance and SD of a Binomial Random Variable). Again, as in Ex- 
ample 5.1.13 we write the binomial r.v. X with parameters n and p as a sum of n 
identical and pairwise independent (this time, the independence is crucial) Bernoulli 
random variables X; with parameter p. Then X = S, = )~/_, Xj, and so, by the 
square root law, 


Var(X) = nVar(X;) = npq, (5.88) 
SD(X) = ./npq, (5.89) 
and 
sox jee. (5.90) 
n 
4 


There is another important general relation that we should mention here. It gives 
bounds for the probability of the tails of a distribution expressed in terms of multiples 
of the standard deviation. That such a relation exists should not be surprising, because 
both quantities— standard deviation and tail probability —are measures of the width 
of a distribution. 


Theorem 5.2.6 (Chebyshev’s Inequality). For any random variable X with mean 
wand variance o? and any positive number k, 


1 
P(\X —p|>ko) < iE" (5.91) 


Proof. Clearly, 
P(|X — p| > ko) = P(X — p)* > ko?) (5.92) 


and, applying Markov’s inequality (Theorem 5.1.2) to the nonnegative random vari- 
able (X — py) with a = k*07, we get 
E(X-p))_ o _ 1 


ae (5.93) 


P(X — p| > ko) < 
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Theorem 5.2.6 should be used to estimate tail probabilities only if we do not 
know anything about a distribution. If we know the d_-f., then we should use that to 
find a precise value for P(|X — | > ko), which is usually much smaller than 1/k?. 
(See the example below.) 


Example 5.2.7 (Tail Probabilities of an Exponential Random Variable). Let T be an 
exponential r.v. with parameter A = 1. Then w = o = 1 and F(t) = 1 —e for 
t > 0. Also, 


PUX-lN>Hh=P(X-1>H=1-F+kh=e!* fork >1. (5.94) 


Thus 
0.14 ifk=1 
P(X -—l]>k)*¥ 40.05 ifk=2 (5.95) 
0.02 ifk=3, 
while Chebyshev’s inequality gives 
1 ifk=1 
P(X —1|>k)< 30.25 ifk=2 (5.96) 
0.11 ifk=3. 


4 


The most important use of Chebyshev’s inequality is in the proof of a limit theo- 
rem, known as the law of large numbers:4 


Theorem 5.2.7 (Law of Large Numbers). For any positive integer n, let X1, X2, 
... Xn be iid. random variables with mean 1 and standard deviation o . Then, for 
any € > 0, their mean X, satisfies the relation 


lim P(|X, —p| <¢)=1. (5.97) 
n—-> oo 
Proof. By Corollary 5.2.1, for any iid. X1, X2,..., X» with mean jy and standard 


deviation o , their average X, has E(X,,) = wand SD(X,,) = o/./n. Thus, applying 
Chebyshev’s inequality to X, with e« = k(a/./n), we obtain P(|X;, — | > ¢) = 
P(|Xn — w| > k(o/Jn)) < 1/k? = 07 /(ne?). Since o7/(ne*) > 0.as n — 00, the 
left-hand side is squeezed to 0 as n — oo. 0 


Remarks. 
1. The relation 5.97 is true even if o does not exist for the X;. 
4 In fact, there is a stronger version of this law: P(limy—o0 Xn = 4) = | under appropriate 


conditions, but we do not prove this strong law of large numbers here. Actually, the precise 
name of Theorem 5.2.7 is the weak law of large numbers. 
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2. In the special case of the X; being Bernoulli random variables with parameter p, 
the mean X,, is the relative frequency of successes in n trials and jz = p. In that 
case, the law of large numbers says that, as n — oo, the relative frequency of 
successes will be arbitrarily close to the probability of success with probability 
1. (Note that this is only a probability statement about p. We cannot use this 
theorem as a definition of probability, that is, we cannot say that the relative 
frequency becomes the probability p; we can only make statements about the 
probability of this event, even in the stronger version in the footnote.) 

3. The SD’s of S, and X,, are sometimes called their standard errors (SE). 


Exercises 


Exercise 5.2.1. Find two random variables X and Y whose variances do not exist but 
the variance of their sum does. 


Exercise 5.2.2. 1. Let X and Y be two independent random variables whose vari- 
ances exist. Show that Var(X — Y) = Var(X + Y) in this case. 
2. Is the above relation necessarily true if X and Y are not independent? 


Exercise 5.2.3. Let X and Y be two independent random variables whose variances 
exist. For any constants a, b,c, express Var(aX + bY +c) in terms of Var(X) and 
Var(Y). 


Exercise 5.2.4. Show that the converse of Theorem 5.2.4 is false: For X and Y as in 
Exercise 5.1.18 the relation Var(X + Y) = Var(X) + Var(Y) holds, although X and 
Y are not independent. 


Exercise 5.2.5. Prove that if for ar.v. X, both E(X) = uw and SD(X) =o exist and 
if c is any constant, then 


1. E((X —c)”?) = 07 + (u —c)’ and, 
2. min, E((X —c)”) = Var(X), that is, the mean of squared deviations is minimum 
if the deviations are taken from the mean. 


Exercise 5.2.6. Let X and Y be two independent random variables, both with density 
f@a= 3x? for x € [0, 1] and 0 otherwise. Find the expected value and the variance 


Exercise 5.2.7. Let X and Y be two independent exponential random variables, both 
with parameter A. Find the expected value and the variance of 
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i 4-2Y, 
2X =2Y, 
3. XY, 

4. X?, 

5, ayy, 


Exercise 5.2.8. Let X be a binomial random variable with E(X) = 5. Find the least 
upper bound of SD(X) as a function of n. 


Exercise 5.2.9. Toss a fair coin n times, and let X denote the number of H’s and Y 
the number of 7’s obtained. Does E(XY) = E(X)E(Y) hold in this case? (Hint: 
Compute E(XY) by computing E((X + Y)*) in two ways.) 


5.3 Moments and Generating Functions 


The notions of expected value and variance of a r.v. X can be generalized to higher 
powers of X: 


Definition 5.3.1 (Moments). For any positive integer k, we call E(X*) the kth mo- 
ment of X and E((X — p)*) the Ath central moment of X, if they exist. (The name 
“moment” is borrowed from physics.) 


Thus, £(X) is the first moment and Var(X) is the second central moment of X. 
Other than these two, only the third and fourth central moments have some proba- 
bilistic significance: they can be used to measure the skewness and the flatness of a 
distribution. 

The use of moments is analogous to the use of higher derivatives in calculus. 
There, higher derivatives have no independent geometrical meaning, but are needed 
in Taylor expansions. Similarly, higher moments are significant only in the Taylor 
expansions of certain functions obtained from probability distributions: the moment 
generating function, the probability generating function, and the characteristic func- 
tion. 

The moment generating function is closely related to the Laplace transform, 
which may be familiar from differential equations courses, and has similar proper- 
ties. Its main use is the simplification it brings in finding the distributions of sums of 
i.i.d. random variables, which would, in most cases, be hopeless with the convolution 
formula when the number of terms gets large. 


Definition 5.3.2 (Moment Generating Function). The moment generating function 
(m.g.f.) wy or Wx of any random variable X is defined by 


w(t) = E(e*). (5.98) 


Clearly, the m.g.f. may not exist for certain random variables or for certain values 
of t. For most distributions that we are interested in, w(t) will exist for all real ¢ or 
on some interval. 


150 5 Expectation, Variance, Moments 


Also, note that the m.g.f., being an expectation, depends only on the distribution 
of X, and not on any other property of X. That is, if two r.v.’s have the same distri- 
bution, then they have the same m.g/f. as well. For this reason, it is correct to speak 
of the m.g.f. of a distribution rather than that of the corresponding r.v. 


Example 5.3.1 (Binomial Distribution). If X is binomial with parameters n and p, 
then 


vine (H)=35 (prone 


x=0 
" (n 
x —x n 
=> (") (pe!) 4g" = (pel +4)". (5.99) 
In particular, ifn = 1, then X is Bernoulli and its m.gf. is 


W(t) = pe’ +q. (5.100) 


Example 5.3.2 (Geometric Distribution). If X is geometric with parameter p, then 


[o,@) 
e 
=> p(ge')' q7! = —_. (5.101) 
Example 5.3.3 (Uniform Distribution). If X is uniform on [a, b], then 


b etx efx 
¥@ = : haan Gat 


b 
et — ett 


, bat’ 


(5.102) 


Example 5.3.4 (Exponential Distribution). If X is exponential with parameter A > 0, 
then 


ee [o,e) 
Ww) = ee dx = | gtx gy 
0 


nN ioe) 
_ elt Ax = 


t—2r 0 —t 


ift <A. (5.103) 


Clearly, w(t) does not exist fort > 2. 4 
Let us see now how the m.g-f. and the moments are connected. 
Theorem 5.3.1 (yw Generates Moments). /f the m.g.f. w of a random variable X 


exists for all t in a neighborhood of 0, then all the moments of X exist, and 


00 k 
WO) = EXE, (5.104) 


k=0 
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that is, the moments are the coefficients of the Maclaurin series of w. 
Also, the function is then infinitely differentiable at 0 and 


v0) = E(X*) fork =0,1,2,.... (5.105) 


Proof. We omit the technical details and just outline the proof. Since the Maclaurin 
series of e'* is 


SEXY 
bx = ( ) 


= , (5.106) 
ia. 
which is convergent for all real t, we have 
00 k 
- 1X, _ (tX) 
VW) =EC*)=E (5: a 
k=0 
es) k oo k 
(tX) wt 
= e( i ) =) EX ae (5.107) 
k=0 k=0 


Equation 5.105 follows from Equation 5.104 by differentiating both sides k times 
and setting ¢ = 0. | 


Example 5.3.5 (Mean and Variance of Exponential X ). If X is exponential with pa- 
rameter A > 0, then expanding the m.g.f. from Example 5.3.4 into a geometric series 
we obtain 


oo 4k 

v(t) : : > ee (5.108) 

= = = 1 <A. . 
hat be efh eee 

Comparing the coefficients of t and t in the sum here with those of Equation 5.104 

results in 


1 
E(X) = 7 (5.109) 
and 
as — 7 (5.110) 
Hence 
2 >» 2 1 1 
Var(X) = E(X~) — [E(X)] ae aa (5.111) 


just as in Chapter 4. 
We could, of course, also have obtained these results by using Equation 5.105 
rather than Equation 5.104. 4 
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The most important properties of the m.g-f. are stated in the next three theorems. 


Theorem 5.3.2 (The Multiplicative Property of Moment Generating Functions). 
For any positive integer n, let X,, X2,... , Xp be independent random variables with 
mg f. Wi, W2,-.+. Wn, respectively and let Y = )~"_, X;. Then wy (t) exists for all 
t for which each yj; (t) exists, and 


v(t) =| [ wi. (5.112) 
i=1 


Proof. 
Wy(t) = E(e’") =E (2%!) =E (1) 
i=l 
=] Ez (e*) =[[ vie. (5.113) 
i=l i=l 


The next two theorems will be stated without proof. Their proofs can be found in 
more advanced texts. 


Theorem 5.3.3 (Uniqueness of the Moment Generating Function). /f the moment 
generating functions of two random variables are equal on a neighborhood of 0, then 
their distributions are also equal. 


Theorem 5.3.4 (Limits of Sequences of Moment Generating Functions). Let 
X1, X2,... be a sequence of random variables with m.g.f’s Wi, w2,... and df?s 
Fy, Fo, .... TfMmj so Wi (t) = W(t) for all t in a neighborhood of 0, then 


lim Fj(x) = F(x) 


exists for all x and w(t) is the m.g f. of ar.v. whose df. is F. 


Example 5.3.6 (Sum of Binomial Random Variables). We rederive the result of Ex- 
ample 4.5.6, using m.gf.’s. 

Let X and Y be independent, binomial r.v.’s with parameters nj, p and nz, p, 
respectively. Then Z = X + Y is binomial with parameters n; + na, p. 

By Example 5.3.1 


Wx (t) = (pel +4)" and py (1) = (pet +4)". (5.114) 
Hence, by Theorem 5.3.2, the m.g-f. of Z = X + Y is given by 
wz(t) = (pel +q)yr™. (5.115) 


This function is the m.g.f. of a binomial r.v. with parameters n; + 12, p, and so, by 
the uniqueness theorem, Z = X + Y is binomial with parameters 1; + 12, p. 4 
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Equation 5.104 is a particular case of the generating function of a sequence. 
In general, for any sequence ao, a1,..., we call G(s) = peer azs* its generating 
function. (As known from calculus, the infinite sum here is convergent on a finite or 
infinite interval centered at 0, or just at the point 0 itself.) Thus the m.g-f. is the gen- 
erating function of the sequence (E(X*)/k!) and not of the sequence of moments, 
despite its name. 

For a discrete random variable, the probability function provides another se- 
quence, in addition to the moments. The corresponding generating function for non- 
negative, integer-valued random variables plays an important role in many applica- 
tions. 


Definition 5.3.3 (Probability Generating Function). The probability generating 
function (p.g.f.) G or Gy of any nonnegative integer-valued random variable X is 
defined by 


GH=E64=F) iQ, (5.116) 


x=0 
where f is the pf. of X. 


If we put s = | in Equation 5.116, then the sum on the right-hand side becomes 
the sum of the probabilities, and so we obtain 


Gd) =1. (5.117) 


Hence, the power series in Equation 5.116 is convergent for all |s| < 1. 

If we know the generating function G, then we can obtain the probability function 
f from Equation 5.116, either by expanding G (s) into a power series and extracting 
the coefficients, or by using the formula 


GG“) 
k! 


f(k)= fork =0,1,.... (5.118) 
The p.gf. is closely related to the m.gf. If we let s = e’ in Equation 5.116, then 
we obtain 


W(t) = Gee’). (5.119) 


Thus, the p.g.f. has properties similar to those of the m.g.f. and, specifically, it has 
the corresponding multiplicative and uniqueness properties. The p.g.f. is, however, 
defined only for nonnegative integer-valued random variables, whereas the m.g.f. ex- 
ists for all random variables whose moments exist. The p.g-f. is used to derive certain 
specific distributions, mainly in problems involving difference equations, such as the 
gambler’s ruin problem (Example 3.5.5), which we shall revisit below, and the m.g_f. 
is used to derive general theorems like the CLT in Section 6.3. 
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Example 5.3.7 (The Gambler’s Ruin). In Example 3.5.5 we asserted that the differ- 
ence equation 


1 1 
P(Am) = P(Am+1) - 5 + P(Am—1) « 5 (5.120) 


is known to have the general solution P(A,,) = a + bm, where a and b are arbitrary 
constants. While it is easy to see by direct substitution that P(A,,) = a+ bm isa 
solution, it is not obvious that there are no other solutions. We now use the p.g-f. to 
prove this fact. 

Multiplying both sides of Equation 5.120 by s” and summing over m from | to 
Co, we get 


00 1 5 @ 
= +1 = 
2 PAm)s™ — = 2 PAme is + 5 dX P(Am—1)s”” : (5.121) 


With the notations py = P(A») and G(s) = ye arian, the above equation 
can be written as 


1 Ss 
G(s) — po = 5, LG) pis — pol+ ge): (5.122) 
and, solving for G(s), we obtain 


(pi —2po)s + Po 


ae 2 esge 


(5.123) 


As known from calculus, the expression on the right-hand side can be decom- 
posed into partial fractions as 


a bs 
G(s) = —— + 


3 ae (5.124) 


with appropriate constants a and b. These partial fractions are well-known sums of a 
geometric series and of one derived from a geometric series,> and so 


G(s) = 3 as™ + 3 bms™ = ae + bm)s™. (5.125) 


m=0 m=0 m=0 


Comparing this result with the definition of G(s), we can see that pm = a +bm 
must hold for all m. In particular, pp) = a and py = a+b, and soa = po and 
b= Pi — Po- 

5 The second sum can be derived by differentiation from the geometric sum: 


Co 


m — m—1 d< m d 1 s 
) ms ="> ms =s—) Ss =Ss = : 
— ds — dsi1-—s (1—s)? 
m=0 m=0 


m=0 
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The moments can also be obtained directly from the p.g-f. For instance, 


ee= > fom (5.126) 
x=0 
and so® 
Gs > f@e = ECO). (5.127) 
x=0 
Similarly, 
G"(s) = Yo f@)x@ — Ds, (5.128) 
x=0 
and 
G"(1) = > f(@)x(@ — 1) = E(X*) - E(X). (5.129) 
x=0 
Hence 
E(X?) = G’(1) + G'(), (5.130) 
and 
Var(X) = G’(1) + G'(1) — Gd). (5.131) 


As mentioned at the beginning of this section, there is yet another widely used 
function related to the generating functions described above: 


Definition 5.3.4 (Characteristic Function). The characteristic function ¢ or dy of 
any random variable X is defined by 


b(t) = Ete"). (5.132) 


This function has properties similar to those of the m.g-f., and has the advantage 
that, unlike the m.g.f., it exists for every random variable X since e!’* is a bounded 
function. On the other hand, its use requires complex analysis, and therefore we shall 
not discuss it further. 


© Since the power series of G(s) may not be convergent for s > 1, we consider G’(1) and 
G" (1) to be left derivatives. 
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Exercises 


Exercise 5.3.1. Show that, for independent random variables, the third central mo- 
ments are additive. That is, writing m3(X) = E((X — wx)?), we have, for indepen- 
dent X and Y,m3(X + Y) = m3(X) 4+ m3(Y). 


Exercise 5.3.2. Show that, for independent random variables, the fourth central mo- 
ments are not additive. That is, writing m4(X) = E((X — uw x)*), for independent X 
and Y,m4(X + Y) 4 m4(X) + m4(Y) in general. 


Exercise 5.3.3. Express the m.g.f. wy of Y = aX + D in terms of wy. 


Exercise 5.3.4. Use the m.g.f. from Example 5.3.2 to show that for a geometric r.v. 
Var(X) = q/p?. 


Exercise 5.3.5. For any random variable X, the function yy_,, is called the central 
moment generating function of X. Find yx—, for an X having the binomial n, p 
distribution, and use yy_,, to find Var(X). 


Exercise 5.3.6. Find the m.g-f. and the p.g.f. of a discrete uniform r.v. with possible 
values 1,2,... ,. Simplify your answers. 


Exercise 5.3.7. Let X and Y be i.i.d. random variables with m.g.f. w. Express the 
m.gf. wz of Z = Y — X interms of wy. 


Exercise 5.3.8. Let X be a continuous rv. with density f(x) = (1/2)e7?! for 
-0O <x < OH. 


1. Show that y(t) = re 
2. Use this w to find a formula for the moments of X. 


Exercise 5.3.9. Find the p.g.f. of a binomial n, p random variable. 
Exercise 5.3.10. Find the p.g.f. of a geometric random variable with parameter p. 


Exercise 5.3.11. We roll three dice. Use the p.g-f. to find the probability p, that the 
sum of the points showing is k for k = 3, 4, and 5. (Hint: Cf. Exercise 5.3.6.) 


5.4 Covariance and Correlation 


The expected value and the variance provided useful summary information about 
single random variables. The new notions of covariance and correlation, to be intro- 
duced in this section, provide information about the relationship between two random 
variables. 


Definition 5.4.1 (Covariance). Given random variables X and Y with expected val- 
ues jy and jy, their covariance is defined as 


Cov(X, Y) = E(X — wx)(¥ — py)), (5.133) 


whenever the expected value on the right-hand side exists. 
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Example 5.4.1 (Covariance of (X, Y) Uniform on a Triangle). Let (X, Y) be uniform 
on the triangle D = {(x,y):0<x<y<l]}. 
Then f(x, y) = 2 on D, and 


1 y 1 1 
px =) i 2xdxdy | y-dy = -, (5.134) 
o Jo 0 3 


1 y 1 2 
by -| / 2ydxdy = 2y*dy = =, (5.135) 
0 Jo 0 3 


tf 1 2 
Cov(X, Y) = / 2 (: - =) (> - =) dxdy 
0 J0 3 3 
gc 2 ips 5.136 
=4 (07-3) (v-3) Y= 36° (5.136) 


4 


We can see from the definition that the covariance is positive if (X — wx) and 
(Y — py) tend to have the same sign, as in the example above, and it is negative if 
they tend to have opposite signs. If the sign combinations are equally balanced, then 
Cov(X, Y) = 0. The latter happens, in particular, whenever X and Y are indepen- 
dent, but it can happen in other cases, too. 


and 


Theorem 5.4.1 (An Alternative Formula for the Covariance). /f X and Y are ran- 
dom variables such that E(X), E(Y), and E(XY) exist, then 


Cov(X, Y) = E(XY) — E(X)E(Y). (5.137) 
Proof. From Definition 5.4.1, 
Cov(X, Y) = E(XY — wxY — wyX + wxpy) 


= E(XY)— pxE(Y) — wy E(X) + uxpy 
= E(XY)— E(X)E(Y). (5.138) 


Theorem 5.4.2 (Independence Implies Zero Covariance). For independent ran- 
dom variables X and Y whose expectations exist, Cov(X, Y) = 0. 


Proof. By Theorem 5.1.6 the two terms on the right-hand side of Equation 5.137 are 
equal in this case. | 


As mentioned above, the converse is not true; the covariance may be zero for 
dependent random variables as well, as shown by the next example. 


Example 5.4.2 (Covariance of (X, Y) Uniform on a Disc). Let (X, Y) be uniform on 
the unit disc D = {(x, y): x24 ye < 1}. Then, clearly, wx = wy = 0 and 


1 a/1—x? 1 1 Ix 

Cov(X, Y) = / / —xydydx = / and 1—x2dx =0. (5.139) 
aid =«/t=z2 7F —1 7 

4 
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In order to shed more light on what the covariance measures, it is useful to stan- 
dardize the variables, so that the magnitude of the variables should not influence the 
value obtained. Thus we make a new definition: 


Definition 5.4.2 (Correlation Coefficient). We define the correlation coefficient of 
any random variables X and Y with nonzero variances and existing covariance as 


“atx Tr) 


P(X, Y)= ( (5.140) 
ox oy 
We have the following obvious theorem: 


Theorem 5.4.3 (Alternative Formulas for the Correlation Coefficient). /f 0(X, Y) 
exists, then 


Re -sese (* —py Y- ur) _ Cov(X,¥) _ E(XY) — E(X)EY) 


OX oy OxOoy Oxoy 


(5.141) 


We are going to show that p(X, Y) falls between —1 and +1, with taking on 
the values +1, if and only if there is a linear relation Y = aX + b between X and 
Y with probability 1. (o is +1 if a is positive and —1 if a is negative.) Thus, |p| 
measures how close the points (X, Y) fall to a straight line in the plane. If p is 0, 
then X and Y are said to be uncorrelated, which means that there is no association 
around a line between X and Y. We say that p(X, Y) measures the strength of the 
linear association between X and Y . 

To prove the previous statements, we first present a general theorem about ex- 
pectations. 


Theorem 5.4.4 (Schwarz Inequality). For any random variables X and Y such 
that the expectations below exist, 


[B(XY)? = ECR EY"). (5.142) 


Furthermore, the two sides are equal if and only if PaX + bY = 0) = 1 for some 
constants a and b, not both 0. 


Proof. First assume that E(Y7) > 0. Then, for any real number A, 
0 < E((X —AY)’) = 7 E(Y?) — 20. E(XY) + E(X?), (5.143) 


and the right-hand side is a quadratic function of A whose graph is a parabola facing 
upwards. The minimum occurs at 
_ E(XY) 
— «EW) 


(5.144) 


and at that point the inequality 5.143 becomes 
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0< Bey) ae 2ECY) nxy + EO?) 
= [eo | a CD eae 
2 
a ee (5.145) 


E(¥) 


which is equivalent to the inequality 5.142. 

In the case E(Y?) = 0, by Theorem 5.2.1, P(Y = 0) = 1, and then we also have 
that P(X Y = 0) = | and E(XY) = 0. Thus, inequality 5.142 is valid with both sides 
equal to 0. 

To prove the second statement of the theorem, first assume that PiaX + bY = 
0) = 1| for some constants a and b, not both 0. If a = 0, then this condition reduces 
to P(Y = 0) = 1, which we have just discussed. If a # 0, then, by Theorem 5.2.1, 
E((aX + bY)?) = 0, and so 


a’ E(X*) + 2abE(XY) + b°E(Y*) = 0, (5.146) 


or, equivalently, 


b : 2 b 2, 
(2) E(Y*) + 2—E(XY) + E(X”) =0. (5.147) 


Now, this is a quadratic equation for b/a and we know that it has a single solution. 
(If it had two solutions, then both X and Y would have to be 0 with probability 1: a 
trivial case.) Thus its discriminant must be zero, that is, we must have 


(2E(XY))* — 4E(X’)E(Y’) =0, (5.148) 
which reduces to 
[E(XY)]? = E(X?)E(Y”). (5.149) 


If we assume Equation 5.149, then the last argument can be traced backwards 
and we can conclude that P(aX + bY = 0) = 1 must hold for some constants a and 
b, not both 0. oO 


If we apply Theorem 5.4.4 to (X — x)/ox and (Y — wy)/oy in place of X and 
Y, we obtain the following relation for the correlation coefficient: 


Corollary 5.4.1. For any random variables X and Y such that p(X, Y) exists, 
—1 < p(X, Y) <1. (5.150) 


Furthermore, p(X, Y) = +1 ifand only if P(Y = aX + b) = 1 for some constants 
a # O and b, with sign(p(X, Y)) = sign(a). 


Thus, the correlation coefficient gives a numerical value for the strength of the 
linear association between X and Y, that is, the closer p is to +1, the closer the 
random points (X, Y) bunch around a straight line, and vice versa. Note that the 
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line cannot be vertical or horizontal, because then ¢ would not exist. The correlation 
coefficient conveys no useful information if the points bunch around any curve other 
than a straight line. For example, if (X, Y) is uniform on a circle, then ¢ is zero, even 
though the points are on a curve. 


Table 5.1. 
Student | X | Y | X?2 y2 XY 


A 40 | 50 | 1600 | 2500 | 2000 
B 60 | 55 | 3600 | 3025 | 3300 
C 80 | 75 | 6400 | 5625 | 6000 
D 90 | 80 | 8100 | 6400 | 7200 
E 80 | 90 | 6400 | 8100 | 7200 
Ave. 70 | 70 | 5220 | 5130 | 5140 


Example 5.4.3 (Correlation Between Two Exams). Suppose five students take two 
exams. Let X and Y denote the grades of a randomly selected student, as given in the 
X and Y columns of Table 5.1. The rest of the table is included for the computation 
of p. 

Hence xy = ky = 70, 0x = V5220 — 702 © 17.889, oy = V5130 — 702 © 
15.166, and p © [(5140 — 707)/(17.889 - 15.166)] ~ 0.88. 

The grades of each student are shown below as points in a so-called scatter plot, 
together with the line of best fit in the least squares sense, or briefly, the least squares 
line or regression line, given by y = 70 + (3/4)(x — 70). (The general formula will 
be given below in Theorem 5.4.5, and regression will be discussed in a later chapter.) 
It should not be surprising that the points bunch around a straight line, because we 
would expect good students to do well on both exams, bad students to do poorly, 
and mediocre students to be in the middle, both times. On the other hand, the points 
do not need to fall exactly on a line, since there is usually some randomness in the 
scores; people do not always perform at the same level. Furthermore, the slope of the 
line does not have to be 1, because the two exams may differ in difficulty. 

The value 0.88 for p shows that the points are fairly close to a line (see Fig. 5.3). 
If p were 1, then they would all fall on a line, and if o were 0, then the points would 
seem to bunch the same way around any line through their center of gravity, with no 
preferred direction. 


Example 5.4.4 (Correlation of (X, Y) Uniform on a Triangle). Let (X, Y) be uniform 
on the triangle D = {(x, y):0 <x < y < 1} as in Example 5.4.1. Then 


F 1 py ‘ i) é 
E(X*) = 2x“dxdy = =y-dy =-, (5.151) 
0 Jo 0 3 


1 y 1 1 
E(Y’) -| ‘ 2y*dxdy = 2y*dy = =, (5.152) 
0 Jo 0 2 
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1004 


oo ae ee a gut_(2) -! (5.153) 
=>=-- = =, an =-— = =. . 
Ox 6 3 18 oF 2 3 18 
Thus, 
Cov(X,Y) 1/36 1 
X,Y)= = — 5.154 
ae 1/18 2 pee 


This result shows that the points of the triangle D are rather loosely grouped 
around a line, as can also be seen in Figure 5.4. However, this line is not unique: the 
line joining the origin and the centroid would do just as well as the one shown. @ 


In addition to being a measure of the linear association between two random 
variables, the correlation coefficient is also a determining factor in the slope of the 


Fig. 5.4. The triangle D with the least squares line and the point of averages drawn in. 
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least squares line (which we give here for the special case of a finite number of 
equiprobable points): 


Theorem 5.4.5 (Least Squares Line). Let (X, Y) be a random point with m possible 
values (x;, y;), each having probability 1/m. The line y = ax +b, such that the sum 
of the squared vertical distances 


f(a,b) =) (axi +b - yi)? (5.155) 


i=l 


from the points to it is minimum, is given by the equation 
oy 
y=p—(-px)+py, (5.156) 
ox 


or, equivalently, in standardized form by 


ee eg (5.157) 
Oy Ox 
The proof is left as Exercise 5.4.6. 
Exercises 
Exercise 5.4.1. Prove that 
Var(X + Y) = Var(X) + 2Cov(X, Y) + Var(Y) (5.158) 


whenever each term exists. 


Exercise 5.4.2. Let (X, Y) be uniform on the triangle D = {(x, y):0 <x,0<y, 
x+y < 1}. Compute Cov(X, Y) and p(X, Y). 


Exercise 5.4.3. Let X and Y have the same distribution and let VU = X + Y and 
V=xX-Y. 


1. Show that Cov(U, V) = 0, assuming that each variance and covariance exists. 
2. Show that if X and Y denote the outcomes of throwing two dice, then U and V 
are not independent, although Cov(U, V) = 0 by Part 1. 


Exercise 5.4.4. Let (X, Y) be uniform on the half disc D = {(x, y) : 0 < y,x7+ 
y? < 1}. Compute Cov(X, Y) and p(x, Y). 


Exercise 5.4.5. Let X and Y be discrete random variables with joint probabilities 
P(YX¥ = x;,Y = yj) = Pij fori = 1,2,...,mand j = 1,2,...,n. Also using p; 
for P(X = x;) and qg; for P(Y = yj), write a formula for 


1. Cov(X, Y) and, 
2. p(X, Y). 


5.5 Conditional Expectation 163 


Exercise 5.4.6. Prove Theorem 5.4.5. (Hint: Set the partial derivatives of f(a, b) in 
Equation 5.155 equal to zero and solve for a and b.) 


Exercise 5.4.7. Let X and Y be random variables such that p(X, Y) exists and let 
U = aX +band V = cY +d witha # 0,b,c ¥ 0,d constants. Show that 
p(U, V) = sign(ac) p(X, Y). 


Exercise 5.4.8. Let X and Y be random variables such that Var(X), Var(Y) and 
Cov(X, Y) exist, and let U = aX + bY and V = cX + dY witha,b,c,d con- 
stants. Find an expression for Cov(U, V) in terms of a, b, c,d, Var(X), Var (Y) and 
Cov(X, Y). 


Exercise 5.4.9. Let X and Y be random variables such that Var(X) = 4, Var(Y) = 1 
and p(X, Y) = 1/2. Find Var(X — 3Y). 


Exercise 5.4.10. Suppose in Example 5.4.3 the first exam score of student F is 
changed from 80 to 90. 


1. Recompute p(X, Y) with this change. 
2. Find the equation of the new least squares line. 
3. Draw the scatter plot, together with the new line. 


5.5 Conditional Expectation 


In many applications, we need to consider expected values under given conditions. 
We define such expected values much as we defined unconditional ones; we just 
replace the unconditional distributions in the earlier definitions with conditional dis- 
tributions: 


Definition 5.5.1 (Conditional Expectation). Let A be any event with P(A) 4 0 and 
X any discrete random variable. Then we define the conditional expectation of X 
under the condition A by 


ExX)= )\ xfxjaG). (5.159) 


x: fyja(x)>0 


Let A be any event with P(A) 4 0 and X any continuous random variable such 
that fx|,4 exists. Then we define the conditional expectation of X under the condition 
A by 


Eq(xX)= i Xfxja(x)dx. (5.160) 


If X is discrete and Y any random variable such that fy \y exists, then the condi- 
tional expectation of X given Y = y is defined by 


Ey(X)= > xfxi(.y). (5.161) 


x: fxly x, y)>0 
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If X is continuous and Y any random variable such that fyjy exists, then the 
conditional expectation of X given Y = y is defined by 


CO 
Ey(X) = / xfxly (x, y)dx. (5.162) 
—0o 
All the theorems for unconditional expectations remain valid for conditional ex- 
pectations as well, because the definitions are essentially the same, just that the 
unconditional f’s are replaced by conditional ones. The latter are still probability 
functions or densities and so this change does not affect the proofs. In particular, 
conditional expectations of functions g(X) can be computed for discrete X as 


Ey(g(X))= >) ag) fry. y), (5.163) 


xi fxly (x, y)>0 


and for continuous X as 


Ey(g(X)) = I. g(x) fxly@, y)dx. (5.164) 
Also, 
Ey(aX +b) =aEy(X) +b (5.165) 
and 
Ey(X1 + X2) = Ey(X1) + Ey(X2). (5.166) 


Note that, whether X is discrete or continuous, E,(X) is a function of y, say 
g(y). If we replace y here by the random variable Y, we get a new random variable 
g(Y) = Ey(X). The next theorem says that the expected value of this new random 
variable is E(X). In other words, we can obtain the expected value of X in two 
steps: first, averaging X under some given conditions, and then averaging over the 
conditions with the appropriate weights. This procedure is analogous to the one in the 
theorem of total probability, in which we computed the probability (rather than the 
average) of an event A under certain conditions and then averaged over the conditions 
with the appropriate weights. 


Theorem 5.5.1 (Theorem of Total Expectation). Jf all expectations below exist, 
then 


E(Ey(X)) = E(X). (5.167) 


Proof. We give the proof for the continuous case only. 
By Definition 5.5.1, 


Bix) = fo faye yde= [xP as, (5.168) 
—oo -o fy) 
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Also, by Theorem 5.1.3, 


BEV) = [ Ey(X) fy y)dy. (5.169) 


Thus, 


E(Ey(x) = [ (/ as) frondy 
—oo —oo fy) 


= [*(f_ tensay) ay 


= ia xfx(x)dx = E(X). (5.170) 


Example 5.5.1 (Sum and Absolute Difference of Two Dice). In Table 4.8 we displayed 
fu\v (u, v) for the random variables U = X + Y and V = |X — Y|, where X and Y 
were the numbers obtained with rolling two dice. Hence, for v = 0 we get 


1 
E,(U) =2. 4. 6- 8 - 10- 12.2 =7. 5.171 
y(U) at 6+ 6+ ma at 6 ( ) 


Similarly, E,(U) = 7 for all other values of v as well, and so, using the marginal 
probabilities fy (v), we obtain 


p= Baya Sts oa og! a ag Se 
Seen ae i ae 36 36 36 36 36°C 
(5.172) 


This is indeed the same value that we would obtain directly from the marginal prob- 
abilities fy (u) or from E(U) = E(X)+ E(Y) =2-355. 
Going the other way, from Table 4.9, we have, for instance, for u = 4 


E,(V) =0 Leg ae (5.173) 
wn 3 3°37 


The whole function E,,(V) is given by Table 5.2. 


Table 5.2. 


u 2 | 3 4 5 6 7 8 9} 10 | 11 | 12 


Eu(V) |} 0 | 1 | 4/3] 2 | 12/5 | 3 | 12/5 | 2 | 4/3 1 0 
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Thus 


E(V) = E(Eu(V)) 
1 2 4 3 4 12— 5 


—0. 1- : 2. 
36° 96° 3 36 °° 36" 5 36 

pg Oe Sy et ay aaa 
36. 5 36 36. 3 36 36 36. 360 


As required by the theorem, the direct computation of E(V) from fy (v) gives the 
same result: 


6 
E(V)=0- 1- 2- 3° 4. 5- =—, 5.175 
(Vv) 36 a 36 36 - 36 i“ 36 oe 36 =. 336 ( ) 
Example 5.5.2 (Conditional Expectation for (X,Y) Uniform on Unit Disc). Let 
(X,Y) be uniform on the unit disc D = {(x, y) : x7 + y* < 1} as in Example 
4.6.2. Then 


B(x) = f Xfxjy &, y)dx 


/ 1-y? x 
= J ——__dx=0,  forye(-1,)), (5.176) 
_.finx? 2/1 — y2 


just as we would expect by symmetry. 4 
Let us modify the last example in order to avoid the trivial outcome: 


Example 5.5.3 (Conditional Expectation for (X,Y) Uniform on Half Disc). Let 
(X,Y) be uniform on the right-half disc D = {(x, y) : x7 + y* < 1,0 < x}. 
Then 


B(x) = [ Xfxiy (x, y)dx 


a/l—y2 e j 1— y2 
x= ; 
0 JV1—y? 2 


for y € (-1, 1), (5.177) 


and 


E(X) = E(Ey(X)) = / By(Ofry)dy 


1 ) 
j= 2 3 
=i) [os Viaea 


1 2 4 
1 _ ye 4 
2 po (5.178) 
= Iv 3 


4 
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Before we state the next theorem, we present a lemma: 


Lemma 5.5.1. For any random variables X and Y and any functions g(X) and h(Y) 
such that Ey (g(X)) exists, 


Ey(g(X)h(Y)) = h(Y)Ey(g(X)). (5.179) 


Proof. We give the proof for the continuous case only. 
By definition, 


CO 


Ex(e nr) = | g(x)h(y) fxiy%, ydx 


= no» f 8X) fxly@, y)dx = h(y)Ey(g(X)). (5.180) 


If we replace y by Y, we get the statement of the lemma. a 


The next theorem answers the following question: Suppose that, for given ran- 
dom variables X and Y , we want to find a function p(Y) that is as close as possible to 
X.If we observe Y = y, then p(y) may be considered to be a prediction of the cor- 
responding value x of X.Thus we ask: What is the best prediction p(Y) of X, given 
Y? “Best” is defined in the least squares sense, that is, in terms of minimizing the 
expected value of the squared difference of X and p(Y). The answer is a generaliza- 
tion of the result of Theorem 5.2.5, that the mean of squared deviations is minimum 
if the deviations are taken from the mean, i.e., that F(X) is the best prediction in the 
least squares sense for X. (For example, if we toss a coin a hundred times, the best 
prediction for the number of heads is fifty. On the other hand, if we toss only once, 
then E(X) = 1/2 is not much of a prediction but, still, that is the best we can do.) 


Theorem 5.5.2 (Best Prediction of X, Given Y). For given random variables X 
and Y and all functions p(Y), the mean squared difference E([X — P(Y)P), if it 
exists, is minimized by the function p(Y) = Ey(X). 


Proof. 
E((X — p(Y)P) = E(X — Ey(X) + Ey(X) — p(Y))”) 


= E([X — Ey(X)]*) + E(Ey(X) — p(Y))’) 
+ 2E[(X — Ey(X))(Ey(X) — p(Y))]. (5.181) 


By 5.5.1, the last term can be reformulated as 


2E[(X — Ey(X))(Ey(X) — p(Y))] = 2E(Eyl[(X — Ey(X))(Ey(X) — p()))). 
(5.182) 


On the right-hand side, we can apply the lemma, with X — Ey(X) = g(X) and 
Ey(X) — p(Y) =h(Y). Thus, 
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Ey((X — Ey(X))(Ey(X) — p(Y))] 
= (Ey(X) — p(Y))EyU(X — Ey(X))] 
= (Ey(X) — p(Y))[Ey(X) — Ey(Ey(X))] = 0, (5.183) 


since Ey(Ey(X)) = Ey(X). (The proof of this identity is left as Exercise 5.5.8.) 
Hence, 


E(X — p(¥Y))) = E(X — Ey(X)}*) + E(Ey(X) — p(Y)P), (5.184) 


and, since both terms are nonnegative, the sum on the right-hand side is minimum if 
P(Y) = Ey(X). a 


The notion of conditional expectation can be used to define conditional variance: 


Definition 5.5.2 (Conditional Variance). For given random variables X and Y, the 
conditional variance Var y(X) is defined as 


Var, (X) = Ey([X — Ey(X)). (5.185) 


Clearly, Var,(X) is a function of y and so Vary (X) is a function of the random 
variable Y and, as such, it is another random variable. However, the theorem of total 
expectation does not extend to conditional variances. (We leave the explanation as 
Exercise 5.5.13.) 


Exercises 


Exercise 5.5.1. Prove Theorem 5.5.1 for discrete X and Y. 


Exercise 5.5.2. Roll two dice as in Example 5.5.1. Let U = max(X, Y) and V = 
min(X, Y). Find E,(U) and E,,(V) for each possible value of v and u, and verify 
the relations E(Ey(U)) = E(U) and E(Ey(V)) = E(V). 


Exercise 5.5.3. Define a random variable X as follows: Toss a coin and if we get H, 
then let X be uniform on the interval [0, 2], and if we get T , then throw a die and let 
X be the number obtained. Find F(X). 


Exercise 5.5.4. Suppose a plant has X offspring in a year with P(X = x) = 1/4 
for X = 1, 2,3, 4 and, independently, each offspring has from one to four offspring 
in the next year with the same discrete uniform distribution. Let Y denote the total 
number of offspring in the second generation. Find the values of Ey (Y) and compute 
E(Ex(Y)). 


Exercise 5.5.5. Let (X, Y) be uniform on the triangle D = {(x, y) : 0 < x,0 < 
y,x+y < 1}. Compute E,(Y), Ey(X), E(X) and E(Y). 


Exercise 5.5.6. Let (X, Y) be uniform on the triangle D = {(x,y):0<x<y< 
1}. Compute E,(Y), Ey(X), E(X) and E(Y). 
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Exercise 5.5.7. Let (X, Y) be uniform on the open unit square D = {(x, y) : 0 < 
x<1,0<y<l}andZ=X+4+Y asin Exercise 4.6.7. Find E,(X) and E,(Z). 


Exercise 5.5.8. Prove that Ey (Ey(X)) = Ey(X) if Ey(X) exists. 


Exercise 5.5.9. Let X and Y be continuous random variables with joint density 
f(x, y) and let g(x, y) be any integrable function. Prove that E(Ey(g(X, Y))) = 
E(g(X, Y)) if E(Ey(g(X, Y))) exists. 


Exercise 5.5.10. Show that for arbitrary X and Y with nonzero variances, if F(X) = 
c for all y, where c is a constant, then X and Y are uncorrelated. 


Exercise 5.5.11. Show that for continuous X and Y, if Ey(X) = c for all y, where c 
is aconstant, then E(X) = c and Var(X) = E(Vary(X)) if all quantities exist. 


Exercise 5.5.12. Let X and Y be as in Exercise 5.5.4. Find the values of Vary (Y) 
and compute FE (Vary (Y)) and Var(Y). 


Exercise 5.5.13. Explain why Var(X) 4 E(Vary (X)) in general. 


Exercise 5.5.14. Show that for continuous X and Y, Var(X) = E(Vary(X)) + 
Var(Ey (X)) if all quantities exist. 


5.6 Median and Quantiles 


The expected value of a random variable was introduced to provide a numerical 
value for the center of its distribution. For some random variables, however, it is 
preferable to use another quantity for this purpose, either because E(X) does not 
exist or because the distribution of X is very skewed and E(X) does not represent 
the center very well. The latter case occurs, for instance, when X stands for the 
income of a randomly selected person from a set of ten people, with nine earning 
twenty thousand dollars and one of them earning twenty million dollars. Saying that 
the average income is E(X) = (1/10)(9 - 20, 000 + 20, 000,000) ~ 2, 000, 000 
dollars is worthless and misleading. In such cases we use the median to represent the 
center. Also, for some random variables E(X) does not exist, but a median always 
does. 

We want to define the median so that half of the probability is below it and half 
above it. This aim, however, cannot always be achieved, and even if it can, the median 
may not be unique, as will be seen below. Thus, we relax the requirements somewhat 
and make the following definition: 


Definition 5.6.1 (Median). For any random variable X, a median of X, or of its 
distribution, is a number m such that P(X < m) < 1/2 and P(X > m) < 1/2. 


Note that P(X < m) or P(X > m) can be less than 1/2 only if P(X = m) 4 0 
or, in other words, we have the following theorem: 
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Theorem 5.6.1 (A Condition for P(X < m) = 1/2). For m a median of a random 
variable X, P(X =m) = 0 implies that P(X < m) = 1/2 and P(X > m) = 1/2. 


Proof. Since m is a median, 


1 
P(X <m)< 5 (5.186) 
and 
1 
P(X >m) < 5 (5.187) 
Also, since P(X = m) = 0, we have 
P(X <m)+ P(X >m)=1. (5.188) 


Now, if we had P(X < m) < 1/2, then adding corresponding sides of this inequality 
and inequality 5.187 we would get P(X < m)+ P(X > m) < 1, in contradiction to 
Equation 5.188. Thus, we must have P(X < m) = 1/2 and then also P(X > m) = 
1/2. Oo 


Observe that for continuous random variables the condition P(X = m) = 0 is 
always true, and so is therefore the conclusion of Theorem 5.6.1, also. 

Before considering specific examples, we show that for the large class of sym- 
metric distributions the center of symmetry is a median as well as E(X) (see Theo- 
rem 5.1.1). 


Theorem 5.6.2 (The Center of Symmetry is a Median). /f the distribution of a 
random variable is symmetric about a point a, that is, the pf. or the p.df. satisfies 
f(a—x)= f(a+x) for all x, then a is a median of X. 


Proof. We give the proof for continuous X only; for discrete X the proof is similar 
and is left as an exercise. 

If the density of X satisfies f(a — x) = f(a +x) for all x, then, by obvious 
changes of variables, 


a 0 
Px <a) = [ fina =— [ f(a —x)dx 


= [0 fe-ndx= f° pa+nas 
0 0 


CO 
= / f(u)du =P(X >a). (5.189) 
a 

Since, for continuous X, also 

P(X <a)+P(X >a) =1, (5.190) 
we obtain 

1 
P(X <a) =P(X >a)= 5 (5.191) 


which shows that @ is a median of X. |_| 
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Example 5.6.1 (Median of Uniform Distributions). If X is uniform on the interval 
[a, b], then, by Theorem 5.6.2, the center m = (a + b)/2 is a median. Furthermore, 
this median is unique, because, if c < m is another point, then P(X > c) > 1/2 and, 
ifc > m, then P(X < c) > 1/2. Thus, c is not a median in either case according to 
Definition 5.6.1. 4 


The next example shows that even if the distribution is symmetric, the median 
does not need to be unique. 


Example 5.6.2 (Median of a Distribution Uniform on Two Intervals). Let X be uni- 
form on the union [0, 1] U [2, 3] of two intervals, that is, let 


1/2 if0<x<1 
f@= 41/2 if2<x <3 
0 otherwise. 


Then f(x) is symmetric about aw = 3/2, and so, by Theorem 5.6.2, 3/2 is a median, 
but, clearly, any m in [1, 2] is also a median. © 


In the next example, P(X < m) 4 1/2. 


Example 5.6.3 (Median of a Binomial). Let X be binomial with parameters n = 4 
and p = 1/2. Then, by symmetry, m = 2 is a median. But, since P(X = 2) = 
(5)(1/2)4 = 3/8, we have P(X < 2) = P(X > 2) = (1/2)(1 — (3/8)) = 5/16, and 
so, 2 is the only median. 


Example 5.6.4 (Median of the Exponential Distribution). Let T be exponential with 
parameter 4. Then P(T < t) = F(t) is continuous and strictly increasing on (0, 00), 
and so we can solve F(m) = 1/2, that is, by Definition 4.2.3, solve 


1 
lao = -, (5.192) 


N 


Hence, 
m= — (5.193) 


is the unique median. 

In physics, such a T is used to represent the lifetime of a radioactive particle. In 
that case, m is called (somewhat misleadingly) the half-life of the particle, since it is 
the length of time in which the particle decays with probability 1/2 or, equivalently, 
the length of time in which half of a very large number of such particles decay. 


An interesting property of medians is that they minimize the “mean absolute 
deviations” just as the expected value minimizes mean squared deviations (Exercise 
5.2.5 and Theorem 5.5.2): 
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Theorem 5.6.3 (Medians Minimize Mean Absolute Deviations). For any random 
variable X such that the expected values below exist, 


min E(|X — cl) = E(\X — m)) (5.194) 


for any median m of X. 


Proof. We give the proof only for continuous X with density f(x). 
Let m be any median of X and c any number such that c > m. (For c < m the 
proof would require minor modifications.) Then 


E(|X — cl) — E(\X — m)) 


=i (jx — e| — |x — ml) f a)dx 


= ((¢ — x) — (m — x)) f (x)dx + i (c= x) — @ —m))f (@)dx 


m 


+f (@-0-@-mpfoods 


= (c—m)fixydx + f (+m —2F0ds 


—C m 


+ [oo —c)f (x)dx. (5.195) 


Now, between m and c we have 2x < 2c and —2x > —2c. Adding c + m to both 
sides, we getc +m — 2x >c+m—2c=m-—c. Thus, 


E(|X — cl) — E(|X — m)) 


=) (c—m)fexydx + | (m —e)ftxydx + | (m —c) f (x)dx 


-|/ (c= mfexyd + | m= crus 


= (c — m)[P(X < m) — P(X > m)] = (c —m) (; - ;) =0. (5.196) 


Hence 
E(|X —c|) > E(\X —m)), (5.197) 
for any c, which shows that the minimum of E(|X — c|) occurs for c = m. | 


A useful generalization of the notion of a median is obtained by prescribing an 
arbitrary number p € (0, 1) and asking for a number x, such that F(xp) = P(X < 
Xp) = Pp, instead of 1/2. Unfortunately, for some distributions and certain values 
of p this equation cannot be solved or the solution is not unique, and for those the 
definition below is somewhat more complicated. 
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Definition 5.6.2 (Quantiles). Let X be a continuous random variable with F (x) con- 
tinuous and strictly increasing from 0 to | on some finite or infinite interval 7. Then, 
for any p € (0, 1), the solution x, of F(x») = p or, in other words, x» = F-'(p) 
is called the p quantile or the 100p percentile and the function F~! the quantile 
function of X or of the distribution of X. For general X the p quantile is defined as 
Xp = min{x : F(x) = p} and we define the quantile function F- py F-'(p) = Xp 
for all p € (0, 1). 


Quantiles or percentiles are often used to describe statistical data such as exam 
scores, home prices, incomes, etc. For example, a student’s score of, say, 650 on the 
math SAT is much better understood if it is also stated that this number is at the 78th 
percentile, meaning that 78% of the students who took the test scored 650 or less, 
or in other words, a randomly selected student’s score is 650 or less with probability 
0.78. Also, some distributions in statistics, as will be seen later, are usually described 
in terms of their quantile function F~! rather than in terms of F or f. 

Clearly, the 50th percentile is also a median. Furthermore, the 25th percentile is 
also called the first quartile, the 50th percentile the second quartile, and the 75th 
percentile the third quartile. 


Example 5.6.5 (Quantiles of the Uniform Distribution). If X is uniform on the inter- 
val [a, b], then 


0 ifx <a 
F@)={2-* itaex<b (5.198) 
—a 
1 ifx>b 


is continuous and strictly increasing from 0 to | on (a,b), and so we can solve 
F (xp) = p for any p € (0, 1), 1.., solve 


Xp—a 
=p. (5.199) 
b—-a 
Hence, 
Xp =a+t p(b—a) (5.200) 


is the p quantile for any p € (0, 1). 


Example 5.6.6 (Quantiles of the Exponential Distribution). Let T be exponential 
with parameter 4. Then P(T < t) = F(t) is continuous and strictly increasing 
from 0 to 1 on (0, oo), and so we can solve F(x») = p for any p € (0, 1), ie., solve 
1—e = p. (5.201) 
Hence, 
In(1 — p) 
Xp = —————— 
A i 
is the p quantile for any p € (0, 1). 


(5.202) 
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Example 5.6.7 (Quantiles of a Binomial). Let X be binomial with parameters n = 3 
and p = 1/2. Then 


0 ifx <0 
1/8 if0<x <1 

F(x) =31/2 ifl<x <2 (5.203) 
1/8 if2<x <3 
1 ifx >3. 


In this case we must use the formula x, = min{x : F(x) = p} to find the quantiles. 
For example, if p = 1/4, then the 1/4 quantile xo.25 is the lowest x-value such that 
F(x) => 1/4. As seen from Equation 5.203 xo.25 = 1, since F(1) = 1/2, and for 
x < 1 wehave F(x) = 0 or 1/8. So x = 1 is the lowest value where F (x) jumps 
above 1/4. Similarly, x,» = 1 for any p € (1/8, 1/2], and computing the x, values 
for all p € (0, 1], we obtain 


if0 < p <1/8 
if 1/8 < p<1/2 
if 1/2 < p<7/8 
if7/8 <p<l. 


F(p)=x,) = (5.204) 


WN eRe © 


Exercises 


Exercise 5.6.1. Find all medians of the discrete uniform X on the set of increasingly 
numbered values x1, .X2,... ,Xn 


1. for odd n, 
2. for evenn. 


Exercise 5.6.2. Prove Theorem 5.6.2 for discrete X. 
Exercise 5.6.3. Is the converse of Theorem 5.6.1 true? Prove your answer. 


Exercise 5.6.4. Prove that, for any X, a number m is a median if and only if P(X > 
m) > 1/2 and P(X < m) > 1/2. 


Exercise 5.6.5. Prove by differentiation that, for continuous X with continuous den- 
sity f(x) > O such that the expected values below exist, with m the median and c 
not a median, E(|X —c|) > E(|X — mJ), that is, min, E(|X —c|) occurs only at the 
median. 


Exercise 5.6.6. Let X be uniform on the interval (0, 1). Find the median of 1/X. 


Exercise 5.6.7. Prove that for any X the 50th percentile is a median. 
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Exercise 5.6.8. Find the quartiles of the first and second grades X and Y of a ran- 
domly selected student in Example 5.4.3. 


Exercise 5.6.9. Find and plot the quantile function for an X with density 


a oe 1 
f@=} 2 ' as (5.205) 


0 otherwise. 


Exercise 5.6.10. Find and plot the quantile function for an X uniform on the union 
[0, 1] U [2, 3] as in Example 5.6.2. 


Exercise 5.6.11. Find and plot the quantile function for the X of Example 4.2.3. 


Exercise 5.6.12. Find and plot the quantile function for a binomial X with n = 4 
and p = 0.3. 


6 


Some Special Distributions 


6.1 Poisson Random Variables 


Poisson random variables! are used to model the number of occurrences of certain 
events that come from a large number of independent sources, such as the number 
of calls to an office telephone during business hours, the number of atoms decaying 
in a sample of some radioactive substance, the number of visits to a web site, or the 
number of customers entering a store. 


Definition 6.1.1 (Poisson Distribution). A random variable X is Poisson with pa- 
rameter A > 0, if it is discrete with p.f. given by 


Veet 
prx=ah= 


fork =0,1,.... (6.1) 


The distribution of such an X is called the Poisson distribution with parameter 2. 


We can easily check that the probabilities in Equation 6.1 form a distribution: 


ee) ak —ix eg) nk 
3s re. 3. eee a ey (6.2) 
k=0 . k=0 k! 


The histogram of a typical Poisson pf. is shown in Figure 6.1. 

Now from where does the Formula 6.1 come? It arises as the limit of the binomial 
distribution as nm — oo, while A = np is kept constant, as will be shown below. 
This fact is the reason why the Poisson distribution is a good model for the kind of 
phenomena mentioned above. For instance, the number of people who may call an 
office, say between | and 2 PM, is generally a very large number n, but each person 
calls with only a very small probability p. If we assume that the calls are independent 
of each other, the probability that there will be k calls is then given by the binomial 


! Named after their discoverer Simeon D. Poisson (1781-1840), who in 1837 first introduced 
them to model the votes of jurors. 
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0.159 


0.17 


0.05 7 


Fig. 6.1. Poisson pf. for A = 4. 


distribution. However, we generally do not know n and p, but we can establish the 
mean number np of calls by observing the phone over several days. (We assume 
that there is no change in the calling habits of the customers.) Now, when n is large 
(>100), p is small (<0.01), and 4 = np is known, then the binomial probabilities 
will be very close to their limit as n — oo, the Poisson distribution. So, here is the 
theorem: 


Theorem 6.1.1 (The Poisson Distribution as the Limit of the Binomial). /fn — 
co and p — 0 such that np = 4 is constant, then 


n ot — pyr eee k=0,1 6.3 
,)Pa py > —— fork =0,1,.... (6.3) 
Proof. 
n _ n(n—1)---(n—k+1) , = 
poi=py "= pi(1— py" * 
k k! 
n(n—1)---(n—k+1) z 
— Pink n* p*(1 py" k 
hal n—-k+1 k np\n-k 
— klon n (np) =) 
1 1 =e eta ee 
= —[(1—-—)]---{1——— )a* {1-— 1-- 
k! n n n n 
1 ak —ir 
ee er ee ee (6.4) 
k! k! 
ai} 


Since np is the expected value of the binomial distribution, we expect A to be 
the expected value of the Poisson distribution. Similarly, since the variance of the 
binomial distribution is npg = np(1 — p) = np — np* = 4 — dp, and p > Oin 
the proof above, we expect A to equal the variance of the Poisson distribution, also. 
Indeed: 
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Theorem 6.1.2 (Expectation and Variance of the Poisson Distribution). [f X is 
Poisson with parameter i, then 


E(X) = Var(X) = 2. (6.5) 
Proof. 
ee) vkew* ie) yk-le-4 
E(X)= k—— =i ——.. 6.6 
() d, k! ps (k — 1)! oe) 


If we change from the variable k to i = k — 1, then the expression on the right 
becomes 


EQ)=A) | = =A-1=). (6.7) 
=e 


To obtain the variance, we first compute E(X (X — 1)): 


90 = CO yk-2,-2 00 yi jh 
E(X(X —1)) = pm Jie = =)? » 5 = 9 fi = = 2, 

(6.8) 

Hence 
E(X(X — 1)) = E(X’) — E(X) = E(X*) -A=2°, (6.9) 

and so 
E(X*) =’7 +A. (6.10) 

Thus 

Var(X) = E(X?) — (E(X)? = 4+A-VME=A, (6.11) 
a 


Theorem 6.1.3 (Moment Generating Function of the Poisson Distribution). /f X 
is Poisson with parameter i, then 


w(t) = exp{r(e’ — 1}. (6.12) 
Proof. 
ied) k-A ine) tyk 
ECE) = Yoel ety on = exp{a(e! — 1)}. (6.13) 
k=0 . k=1 ; 
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Example 6.1.1 (Misprints on a Page). Suppose a page of a book contains n = 1000 
characters, each of which is misprinted, independently of the others, with a probabil- 
ity p = 107+. Find the probabilities of having (a) no misprint, (b) exactly one, and 
(c) at least one misprint on the page, both by the binomial formula, exactly and, by 
the Poisson formula, approximately. 

Let X denote the number of misprints. Then 

(a) by the binomial formula, 


1000 
P(X =0) = ( A Jao = 16 7 2 0504 833. (6.14) 
and by the Poisson approximation with A = 1000 - 10-* = 0.1, 
oe" 
P(X =0) = =—a- = 0.904 837. (6.15) 
(b) By the binomial formula, 
1000 
P(X =1)= ( ' Jao" 10-7)" = 00904923, (6.16) 
and by the Poisson approximation, 
0.11e-0.1 
P(X =1)= = = 0.0904 837. (6.17) 
(c) By the binomial formula, 
P(X > 1) =1-— P(X =0) © 1 — 0.904 833 = 0.095 167, (6.18) 
and by the Poisson approximation, 
P(X > 1) =1—P(X =0) © 1 — 0.904 837 = 0.095 163. (6.19) 


While the above approximations are interesting, they are not really necessary. Ac- 
tually, even the binomial model is only an approximation, because misprints some- 
times occur in clumps and may not be quite independent and their probabilities may 
vary. Also, not all pages have exactly 1000 characters and it is difficult to measure 
the probability of a character being misprinted, but relatively easy to establish the 
mean number of misprints per page. If we do not know n and p separately, but only 
the mean A = np, then we cannot use the binomial distribution, but the Poisson 
distribution is still applicable. 


Example 6.1.2 (Diners at a Restaurant). Suppose that a restaurant has on the average 
50 diners per night. What is the probability that on a certain night 40 or fewer will 
show up? 

Suppose that the diners come from a large pool of potential customers, who show 
up independently with the same small probability for each. Then their number X may 
be taken to be Poisson, and so 
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40 50k e— 50 
P(X < 40) = > a 0.086. (6.20) 
k=0 . 


On the other hand, if we assume that the customers come in independent pairs 
rather than individually, and denote the number of pairs by Y, then the corresponding 
probability is 


5k —25 


‘0 
P(Y < 20) = > 


k=0 


= 0.185. (6.21) 


These numbers show that, in order to estimate the probability of a slow night, it 
is not enough to know how many people show up on average, but we need to know 
the sizes of the groups that decide, independently from one another, whether to come 
or not. ry 


An important property of Poisson r.v.’s is contained in the following theorem: 


Theorem 6.1.4 (The Sum of Independent Poisson Variables is Poisson). /f X 1 
and X>2 are independent Poisson r.v.'s with parameters , and h2, respectively, then 
X 1 + X2 is Poisson with parameter 4 + do. 


Proof. The joint distribution of X; and X2 is given by 
a Ake — (A, +A) 


Pik = P(X, =i, X2=kK) = = aig = fori,k =0,1,..., (6.22) 
Lik: 


and so 


n 
P(X1+X2=n) =) PX =i, X)=n-i) 


i=0 
n i n-i eT +A2) 1 
= aur = His = — * ia ia 
il(n—i)! nn! f+ il(n—i)! 1? 
e7 Ai+A2) 


(Ay +A)” forn=0,1,.... 


In most applications, we are interested not just in one Poisson r.v. but in a whole 
family of Poisson r.v.’s. For instance, in the previous examples we may ask for the 
probabilities of the number of misprints on several pages and for the probabilities of 
the number of diners in a week or a month. 

In general, a family of random variables X(t) depending on a parameter f¢ is 
called a stochastic or random process. The parameter f¢ is time in most applications, 
but not always, as in the generalization of Example 6.1.1 it would stand for the num- 
ber of pages. Here we are concerned with the particular stochastic process called the 
Poisson process: 
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Definition 6.1.2 (Poisson Process). A family of random variables X (t) depending 
on a parameter ¢ is called a Poisson process with rate 2, for any A > 0, if X(t), 
the number of occurrences of some kind in any interval of length r, has a Poisson 
distribution with parameter Ar for any t > 0, that is, 


- (At)ke—™ 


P(X(t) =k) a 


for any t > Oandk=0,1,..., (6.23) 
and the numbers of occurrences in nonoverlapping time intervals are independent of 
each other. 


Example 6.1.3 (Misprints on Several Pages). Suppose the pages of a book contain 

misprinted characters, independently of each other, with a rate of A = 0.1 misprints 

per page. Assume that the numbers X(t) of misprints on any f pages constitute a 

Poisson process. Find the probabilities of having (a) no misprint on the first three 

pages, (b) at least two misprints on the first two pages, and (c) at least two misprints 

on the first two pages, if we know that there is at least one misprint on the first page. 
(a) In this case t = 3 and At = 0.3. Thus, 


0.3%e-93 
P(X (3) =0) = <x 0.74. (6.24) 


(b) Now t = 2 and At = 0.2, and so 
P(X (2) > 2) = 1 — [P(X (2) = 0) + P(X (2) = 1)] 
_ 0.2! ¢e-0.2 

= 


oi Tl = 0.0175. (6.25) 

(c) Let X; denote the number of misprints on the first page and Xz the number 
of misprints on the second page. Then X; and X2 are both independent and Poisson 
with parameter 0.1, and X (2) = X; + X2. Hence, 


P(XQ) > 2|X, > 1) = POEL XO 2 


P(X; = 1) 
P(X, > 2 P(X; =1,X2>1 
7 (X1 > 2) + P(X ,X2> 1) (6.26) 
P(X; = 1) 
1—e%1 40.1 0.1e~9 [1 — e 9! 
ll ee" (1 +0.1)] + 0.le™ [1 — e Locate 
1 —e°.1 
4 


Poisson processes have three important properties given in the theorems below. 
The first of these is an immediate consequence of Definition 6.1.2 and Theorem 
6.1.4, and says that the number of occurrences in an interval depends only on the 
length of the interval, and not on where the interval begins. This property is called 
stationarity. 
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Theorem 6.1.5 (Poisson Processes are Stationary). For any s,t > 0, 
X(s+t)— X(s) = X(t). (6.27) 


The next theorem expresses the independence assumption of Definition 6.1.2 
with conditional probabilities. It says that the process is “memoryless,” that is, the 
probability of k occurrences in an interval is the same regardless of how many went 
before. 


Theorem 6.1.6 (Poisson Processes are Memoryless). For any s,t > 0 andi,k = 
O12, 


P(X(s +t) =i+k|X(s) =i) = P(X (t) =k). (6.28) 
Proof. For any s,t > Oandi,k =0,1,..., 


Ce oe P(X(s +t) =i+k, X(s) =i) 


P(X (s) =i) 
_ P(X (s +t) — X(s) =k, X(s) =i) 

7 P(X (s) =i) 

_ P(X(t) =k, X(s) =i) 

7 P(X(s) =i) 

_ P(X (t) = k)P(X(s) = i) 

7 P(X (s) = i) 

= P(X(t) =k). (6.29) 


The next theorem shows that in a Poisson process, the “waiting time” for an 
occurrence and the “interarrival time;’ (the time between any two consecutive oc- 
currences) both have the same exponential distribution with parameter A. (In this 
context, it is customary to regard the parameter ¢ to be time and the occurrences to 
be arrivals.) 


Theorem 6.1.7 (Waiting Time and Interarrival Time in Poisson Processes). 


1. Let s > 0 be any instant and let T > 0 denote the length of time we have to wait 
for the the first arrival after s, that is, let this arrival occur at the instant s + T. 
Then T is an exponential random variable with parameter i. 


2. Assume that an arrival occurs at an instant s > 0 and let T > 0 denote the time 
between this arrival and the next one, that is, let the next arrival occur at the 
instant s + T. Then T is an exponential random variable with parameter 2. 


Proof. 1. Clearly, for any t > 0, the waiting time T is < 1, if and only if there is at 
least one arrival in the time interval (s, s + t]. Thus, 


PIT <1) =P(X(s + — X() > 0) =P(X@)>0)=1-e", 30) 
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which, together with P(T < t) = 0 fort < 0, shows that T has the distribution 
function of an exponential random variable with parameter 1. 


2. Instead of assuming that an arrival occurs at the instant s, we assume that it 
occurs in the time interval [s — As, s] and let As — 0. Then, similarly to the 
first part, for any t > 0, 


PIT <1t)= Jim, P(X (s +1) — X(s) > O1X(s) — X(s — As) = 1) 
=P(X(s+1)—X(s) > 0) =P(XH>0)=1-e", (631) 


and P(T < t) = 0 fort < 0. Thus 7, too, has the distribution function of an 
exponential random variable with parameter 1. im) 


Theorem 6.1.7 has, by Example 5.1.4, the following corollary: 


Corollary 6.1.1. [fin a Poisson process the arrival rate, that is, the mean number of 
arrivals per unit time, is 2, then the mean interarrival time is 1/2. 


The converse of Theorem 6.1.7 is also true, that is, if we have a stream of ran- 
dom arrivals such that the waiting time for the first one and the successive interarrival 
times are independent exponential random variables with parameter 1, then the num- 
ber of arrivals X(t), during time intervals of length rt, form a Poisson process with 
rate 4. We omit the proof. 


Exercises 


In all the exercises below, assume a Poisson model. 


Exercise 6.1.1. Customers enter a store at a mean rate of | per minute. Find the 
probabilities that: 


. more than one will enter in the first minute, 

. more than two will enter in the first two minutes, 

. more than one will enter in each of the first two minutes, 

. two will enter in the first minute and two in the second minute if four have 
entered in the first two minutes. 


BRWNe 


Exercise 6.1.2. A textile plant turns out cloth that has | defect per 20 square yards. 
Assume that 2 square yards of this material are in a pair of pants and 3 square yards 
in a coat. 


1. About what percentage of the pants will be defective? 
2. About what percentage of the coats will be defective? 
3. Explain the cause of the difference between the two preceding results. 
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Exercise 6.1.3. In each gram of a certain radioactive substance two atoms will decay 
on average per minute. Find the probabilities that: 


. in one gram more than two atoms will decay in one minute, 

. in two grams more than four atoms will decay in one minute, 

. in one gram more than four atoms will decay in one minute, 

. in one gram the time between two consecutive decays is more than a minute, 

.in two grams the time between two consecutive decays is more than half a 
minute. 


ABWN 


Exercise 6.1.4. In a certain city there are 12 murders on average per year. Assume 
that they are equally likely at any time and independent of each other, and approxi- 
mate the length of each month as 1/12 of a year. Find the probabilities that: 


1. there will be no murders in January and February, 

2. there will be none in exactly two, not necessarily consecutive, months of the 
year, 

3. there will be none in at most two, not necessarily consecutive, months of the 
year, 

4. there will be none in February if there was none in January. 


Exercise 6.1.5. Show that in a Poisson process with rate 1, the probability of an even 
number of arrivals in any interval of length t is (1 + e~?")/2 and of an odd number 
of arrivals is (1 — eo. Hint: First find P(even)— P(odd). 


Exercise 6.1.6. Suppose that a Poisson stream X(t) of arrivals with rate 2 is split 
into two streams A and B, so that each arrival goes to stream A with probability p 
and to stream B with probability g = 1 — p, independently of one another. Prove that 
the new streams are also Poisson processes, with rates pA and qi, respectively. Hint: 
First find a formula for the joint probability P(X 4(t) = m, X g(t) = n) in terms of 
the original Poisson process X(t) and the binomial pf. 


Exercise 6.1.7. Show that in a Poisson process, any two distinct interarrival times 
are independent of each other. 


Exercise 6.1.8. Show that for a Poisson r.v. X with parameter A, maxy,P(X = k) 
occurs exactly at A — | and at A if A is an integer, and only at [A] otherwise. (Here [A] 
denotes the greatest integer < 4.) Hint: First show that P(X = k) = (A/k)P(X = 
k — 1) foranyk > 0. 


6.2 Normal Random Variables 
Definition 6.2.1 (Normal Distribution). A random variable X is normal or normally 


distributed with parameters ju and o”, (abbreviated N (LL, o”)), if it is continuous 
with p.d f. 
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1 4)? Ie? 
(x) = ——e  & Wo for — 00 <x <oo. (6.32) 
f V210 
The distribution of such an X is called the normal distribution with parameters ju 
and o? and the above p.d.f. is called the normal density function with parameters ju 
and 0”. 


The graph of such a function, for arbitrarily fixed j. and o7, is shown in Figure 
6.2. It is symmetric about x = ju, and its inflection points are atx = wto. 


Fig. 6.2. The p.d-f. of a typical normal distribution. 


This distribution was discovered by Abraham de Moivre around 1730 as the lim- 
iting distribution of the (suitably scaled) binomial distribution as n — oo. Never- 
theless it used to be referred to as the Gaussian distribution, because many people 
learned about it from the much later works of Gauss. The name “normal” comes from 
the fact that it occurs in so many applications that, with some exaggeration, it may 
seem abnormal if we encounter any other distribution. The reason for its frequent 
occurrence is the so-called central limit theorem (Section 6.3), which says, roughly 
speaking, that under very general conditions, the sum and the average of n arbi- 
trary independent random variables are asymptotically normal for large n. Thus, any 
physical quantity that arises as the sum of a large number of independent random in- 
fluences will have an approximately normal distribution. For instance the height and 
the weight of a more or less homogeneous population are approximately normally 
distributed. Other examples of normal random variables are: the x-coordinates of 
shots aimed at a target, the repeated measurements of almost any kind of laboratory 
data, the blood pressure and the temperature of people, the scores on the SAT, and so 
on. 

We will list several properties of the normal distribution as theorems, beginning 
with one that shows that Definition 6.2.1 does indeed define a probability density. 


Theorem 6.2.1. 


fe Lg —w)?/20 gy = 1. (6.33) 
—0o V 210 
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Proof. This p.d.f. is one of those functions whose indefinite integral cannot be ex- 
pressed in terms of common elementary functions, but the definite integral above can 
be evaluated by a special trick: First, we substitute x = (t — jz) /o. Then the integral 
becomes 


=F / en dx, (6.34) 
—0oo 


Now, we write y as the variable of integration and multiply the two forms of J, 
obtaining 


1 ed 
= lc 5 =|. oY dy = =|. i e OF 2dxd 
V20 ix V2 a = 
(6.35) 


Changing to polar coordinates, we get 


(6.36) 
Substituting u = r?/2,du = rdr yields 
Cc 
r =y e “du = —e *|9 =1, (6.37) 
0 
and so, since J is nonnegative (why?), J = 1. im 
Theorem 6.2.2. /f X is N(, 07), then 
E(X) =p. (6.38) 


Proof. The p.df. in Definition 6.2.1 is symmetric about x = j, and so Theorem 
5.1.1 yields Equation 6.38. 0 


Theorem 6.2.3. /f X is N(, 07), then 


Var(X) = 07. (6.39) 
Proof. By definition, 
Var(X) = E((X — 1)*) / ” Ge = py gw /20? g (6.40) 
ar — = — a ———-e i an - 
- —o0o a / 210 


as u = (x — 2)/o and integrating by parts, we get 


Var(X) = we" dy = ue" dy (6.41) 
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Theorem 6.2.4 (A Linear Function of a Normal Random Variable is Normal). /f 
X is N(u, 07), then, for any constants a # 0 and b, Y = aX +b is normal with 
E(Y) =au +b and Var(Y) = (ao)?. 


Proof. Assume a > 0. (The proof of the opposite case is left as an exercise.) Then 
the d.f. of Y can be computed as 


Fy) = PU sy) = Pax +b<y)=P(x <2?) 
a 


1 . te=)2/202 
eee eg Pre" dx (6.42) 
AV 2170 —oo 


and so the chain rule and the fundamental theorem of calculus give its p.d-f. as 


1 d —b 21962 
fr) = FLO) = (257) <to-oie- whe 


VJ2n0 dy a 
i| 2 2 
= —(y—-(apt+b))*/2(ao) 
— e . (6.43) 
V20ao 


A comparison with Definition 6.2.1 shows that this function is the p.d-f. of a 
normal r.v. with ay + b in place of jz and (ac)? in place of o7. 


Corollary 6.2.1. If X is N(j, 07), then Z = (X — )/o is N(O, 1). 
Proof. Apply Theorem 6.2.4 witha = 1/o andb = —y/o. B 


Definition 6.2.2. The distribution N (0, 1) is called the standard normal distribution, 
and its p.d.f. and d-f. are denoted by g and ©, respectively, that is, 


1 2 
(z) = ——e*? for —0o <z<0o0, (6.44) 
e V2 
and 
O(z) = = a et gy for —0o <z< OO. (6.45) 
V 21 J—oo 
Corollary 6.2.2. If X is N(, 07), then Fy(x) = ®((x — w)/o). 
Proof. 
X—-w  x-p x—p x—p 
Fy (x) = P(X < x) = P| — < =P(Z< =® ‘ 
oO oO oO oO 
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@(-z) 1-®(z) 


4 0 Zz 


Fig. 6.3. The area of any left tail of g equals the area of the corresponding right tail, that is, 
®(—z) = 1 — ®(z). 


As mentioned before, the p.d.f. of a normal r.v. cannot be integrated in terms 
of the common elementary functions, and therefore the probabilities of X falling in 
various intervals are obtained from tables or by computer. Now, it would be over- 
whelming to construct tables for all 2 and o values required in applications, but 
Corollary 6.2.1 makes this unnecessary. It enables us to compute the probabilities for 
any N(w, 07) rv. X from the single table of the standard normal distribution func- 
tion, which is given (with minor variations) in most probability or statistics books, 
including this one. The next examples illustrate the procedure. 


Example 6.2.1 (Height Distribution of Men). Assume that the height X, in inches, of 
arandomly selected man in a certain population is normally distributed? with . = 69 
and o = 2.6. Find 


1. P(X < 72), 
2, P(X = 72), 
3. P(X < 66), 
A PUx = el 22): 


In each case, we transform the inequalities so that X will be standardized and use 
the ®-table to find the required probabilities. However, the table gives ®(z) only for 
z > 0,and for z < 0 we need to make use of the symmetry of the normal distribution. 
This implies that, for any z, P(Z < —z) = P(Z > z). (See Figure 6.3.) Thus, 


1. P(X < 72) =P((X —p)/o < (72 —69)/2.6) © P(Z < 1.15) = (1.15) 
0.875. 


= Any such assumption is always just an approximation that is usually valid within only 
three or four standard deviations from the mean. But that is the range where almost all of 
the probability of the normal distribution falls, and although theoretically the tails of the 
normal distribution are infinite, g(z) is so small for |z| > 4, that as a practical matter we 
can ignore the fact that it gives nonzero probabilities to impossible events such as people 
having negative heights or heights over ten feet. 
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2. P(X > 72) = P((X — p)/o > (72 —69)/2.6) © P(Z > 1.15) = 1—P(Z 
1,15) = 1 = ®(1,15) = 1 = 0.875 = 0.125. 

3. P(X < 66) =P((X —p)/o < (66 —69)/2.6) ¥ P(Z < -1.15) =P(Z > 
1.15) = 0.125. 

4. P(\X — w| < 3) = P(\(X —p)/o| < 3/2.6) ¥ P(|Z| < 1.15) =1—[P(Z < 
—1,15)4+ P(Z > 1.15)] = 20(1.15) —1 ~ 0.75. 


IA 


Example 6.2.2 (Percentiles of Normal Test Scores). Assume that the math scores on 
the SAT at a certain school were normally distributed with 4. = 560 and o = SO. 
Find the quartiles and the 90th percentile of this distribution. 

For the third quartile, we have to find the score x for which P(X < x) = 
0.75 or, equivalently, P(X — f2)/o < (x — 560)/50) = 0.75. The quantity z = 
(x —560)/50 is called the z-score or the value of x in standard units, and, by 
Corollary 6.2.1, we thus first need to find the z-score for which ®(z) = 0.75, or 
z = &7!(0.75). In the body of the ®-table, look for 0.75 and for the corresponding 
z-value find 0.675. Solving z = (x — 560)/50 for x, we obtain x = 50z + 560 = 
50 - 0.675 + 560 + 594. Hence 75% of the SAT scores were under 594. 

For the first quartile, we have to find the score x for which P(X < x) = 0.25 
or, equivalently, ®(z) = 0.25. However, no p = ®(z) value less than 0.5 is listed 
in the table. The corresponding z would be negative, and instead of finding z we use 
the symmetry of ¢ to find |z| for the corresponding right tail that has area 0.25. Thus 
D(z) = 0.25 is equivalent to | — ®(|z|) = 0.25 or ®(|z|) = 0.75, and the table gives 
|z| = 0.675. Hence z = —0.675 and x = 50z + 560 = 50 - (—0.675) + 560 © 526. 

The 90th percentile can be computed from P(X < x) = 0.90 or, equivalently, 
from ®(z) = 0.90. The table shows z * 1.282, and so x = 50z+560 = 50-1.282+ 
560 © 624. 


Theorem 6.2.5 (The Moment Generating Function of the Normal Distribution). 
If X is N(u, o”), then 


w(t) = eM for — 090 <t < 00. (6.46) 


Proof. First compute the moment generating function of a standard normal rv. Z. 
By definition, 


1Z tz—z2/2 1 - (t?—(z—t)?)/2 
Wz) = Ee“) = a se ie dz 


1 lo) 
— e 
V20 [. V 2m Joo 
17/2 1 o —~(<—t)?/2 17/2 
=e /—— e dz=e for —co<t<o. (647) 
J 20 —oo 
Now X =oZ+ pis N(u, 07), and 
vx @ = EC7) = eM Ee") = ez (ot) 


— pitt+o71?/2 for —0o <t <©o. (6.48) 
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Theorem 6.2.6 (Any Nonzero Linear Combination of Independent Normal Ran- 
dom Variables is Normal). Let X; be independent and N (uj, 3?) random variables 
fori =1,...,n,and let X = }°\a;X; with the a; arbitrary constants, not all zero. 
Then X is N(w, 07), with w = Yajp1j, and o* = Y(a;o;)’. 


Proof. Let yy; denote the moment generating function of X;. Then, by Theorem 5.3.2 
and Equation 6.46, 


ve) =[][ vices) = [ett teres2 = eXwraitt arian’) 
= el Gi t+ (aioi)?)0? /2_ (6.49) 


Comparing this expression with Equation 6.46 and using the uniqueness of the m.g_f., 
we obtain the result of the theorem. oO 


Definition 6.2.3 (Random Sample and Sample Mean). n independent and identi- 
cally distributed (abbreviated: 1.1.4.) random variables X1,... , Xp are said to form 
a random sample of size n, from their common distribution, and X,=( /n) >) Xj 
is called the sample mean. 


Corollary 6.2.3. Let Xj; be iid. N(u, o) random variables fori =1,...,n.Then 
the sample mean is N(, 07 /n). 


Proof. Set aj = 1/n, wi; = ww and oj = o in Theorem 6.2.6 for all 7. a 


Example 6.2.3 (Heights of Men and Women). Assume that the height X, in inches, 
of a randomly selected woman in a certain population is normally distributed with 
jtx = 66 and oy = 2.6 and the height Y, in inches, of a randomly selected man 
is normally distributed with ~y = 69 and oy = 2.6. Find the probability that a 
randomly selected woman is taller than an independently randomly selected man. 

The probability we want to find is P(Y — X < 0). By Theorem 6.2.6, Y — X is 
N(3, 2+ 2.62) = N(3, 13.52). Thus 


Y—xX-—-3 0-3 
< 
Vv 13.52 Vv 13.52 
= 1 — 0.793 = 0.207. (6.50) 


PY-xX <0) = P( ) x ®(—0.816) = 1 — (0.816) 


Exercises 


Exercise 6.2.1. For a standard normal r.v. Z, find 


1.P(Z <2), 
2, UZ = 2), 
3. PZ SD), 
4. P(Z < —2), 
5, P(-2 <7 <2), 
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6. P(|Z| > 2), 

7.P(—2 < Z < 1), 

8. z such that P(z < Z) = 0.05, 

9. z such that P(—z < Z < z) = 0.9, 
0. 


10. z such that P(—z < Z < z) = 0.8. 


Exercise 6.2.2. Let X be a normal r.v. with with w = 10 and o = 2. Find 


1.P(X¥ < 11), 
2.P(X > 11), 

3. P(X < 9), 
4.P(9 < X < 11), 

5.P(9 < X < 12), 

6. x such that P(x < X) = 0.05, 

7.x such that P10 — x < X < 104+%x)=0.9, 
8. x such that P10 — x < X < 10+ x)=0.8. 


Exercise 6.2.3. 1. Prove that the standard normal density ¢ has inflection points at 
z=. 
2. Prove that the general normal density given in Definition 6.2.1 has inflection 
points atx =jto. 


Exercise 6.2.4. Assume that the height X, in inches, of a randomly selected woman 
in a certain adult population is normally distributed with wy = 66 and ox = 2.6 
and the height Y, in inches, of a randomly selected man is normally distributed with 
jty = 69 and oy = 2.6 and half the adult population is male and half is female. 


1. Find the probability density of the height H of a randomly selected adult from 
this population and sketch its graph. 

2. Find E(#H) and SD(#Z). 

3. Find P(66 < H < 69). 


Exercise 6.2.5. Prove Theorem 6.2.4 fora < 0 


1. by modifying the proof given for a > 0, 
2. by using the moment generating function. 


Exercise 6.2.6. Assume that the math scores on the SAT at a certain school were 
normally distributed with unknown jz and o and two students got their reports back 
with the following results: 750 (95th percentile) and 500 (46th percentile). Find yz 
and o. (Hint: Obtain and solve two simultaneous equations for the two unknowns ju 
and o.) 


Exercise 6.2.7. The p.d-f. of a certain distribution is determined to be of the form 
ce~@+2)"/24 Find uw, 0 and c. 


Exercise 6.2.8. The p.d-f. of a certain distribution is determined to be of the form 
ce-* —4 Bind f,o andc. 
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Exercise 6.2.9. Assume that the weight X, in ounces, of a randomly selected can of 
coffee of a certain brand is normally distributed with « = 16 and o = 0.32. Find 
the probability that the weights of two independently selected cans from this brand 
differ by more than 1/2 oz. 


Exercise 6.2.10. Let Z,, denote the sample mean for a random sample of size n from 
the standard normal distribution. For n = 1, 4 and 16 


1. sketch the p.d-f. of each Z,, in the same coordinate system, 
2. compute the quartiles of each Z,. 


Exercise 6.2.11. Prove that O~!(1 — p) = —®~!(p) for0 < p <1. 


Exercise 6.2.12. Prove that if X is N(j, 07), then Fy'(p) = w+o0-!(p) for 
O<p<l. 


6.3 The Central Limit Theorem 


Earlier, we saw that the binomial distribution becomes Poisson if n — co while 
p — O such that np = i remains constant. About a hundred years before Poisson, 
de Moivre noticed a different approximation to the binomial distribution. He ob- 
served and proved that if n is large with p fixed, then the binomial probabilities are 
approximately on a normal curve. An illustration of this fact can be seen in Figure 
6.4 and is stated more precisely in the subsequent theorem.? 


Fig. 6.4. Histogram of the binomial p.f. for nm = 60 and p = 1/2, with the approximating 
normal p.d.f. superimposed. 


3 de Moivre discovered the normal curve and proved this theorem only for p = 1/2. For 
Pp # 1/2 he only sketched the result. It was Pierre-Simon de Laplace who around 1812 gave 
the details for arbitrary p, and outlined a further generalization, the central limit theorem. 
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Theorem 6.3.1 (De Moivre—Laplace Limit Theorem). The binomial probabilities 
Pr= te )p*q"- * can be approximated by the corresponding values of the N (1, 07) 
distribution with matching parameters, that is, with 4 = np and o* = npq. More 
precisely, 


1 2 2 
—(k—p)*)/20 
a e ; (6.51) 
u V 200 


where the symbol ~ means that the ratio of the two sides tends to 1 asn — oo. 


Proof. We just give an outline, with various technical details omitted. 

The proof rests on the linearization In(1 + x) ~ x as x — O, known from 
calculus. 

We want to express p; for k-values near the mean np. Thus, writing m = [np] 
we have, for k > m (for k < m the argument would be similar; we omit it), 


Pk _ n!} m!(n — m)! gg 
Pm k!(n —k)! n! pmqr—m 
! 7 ' k-—m k —i+l 
_ . @ a) (2) 2: Il (= 2). (6.52) 
k! (n—k)! qd al i qd 


Next, we take logarithms and replace i with j = i — m. Then* 


in (2 t) ~ ~yn (ae 2) 
Pm m+ ij q 


nope? 2) 


j=l j=l 
k—m 


=e 
Eo 


k—m nqg—j Pp k—-m 
~ In In 
oe a i )= a (ee) 


j=l = | 
1 &-—my k — np)? 
npq 2 — O99) 
Hence 
Pk ~ me &-"PY (2084 (6.54) 


4 If n > oo with p and k fixed, then n —i + 1 ~ n —i foralli < k and np ~ [np], as well. 
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The sum of the probabilities px over all k-values equals 1. Now, on the right-hand 
side we can approximate the sum by an integral (again, if nm — oo), and then from 
the definition of the normal distribution we see that we must have 


Pm ~ Sige (6.55) 
with o = ,/npq, and 
1! —(k=np)* /2npq 
Pk~ Neri . (6.56) 
Oo 


If S, is a binomial r.v. with parameters n and p, then, for integers a and b with 
ax<b, 


b 
P@=5,=0) =). pe (6.57) 
k=a 


Now the sum on the right-hand side equals the sum of the areas of the rectangles of 
the histogram of the p, values. It can be approximated by the area under the corre- 
sponding normal curve, that is, by the integral of the p.d-f. on the right of Equation 
6.56, from the beginning a — 1/2 of the first rectangle to the end b + 1/2 of the last 
rectangle: 


1 b+1/2 > > 
Pa < S, <b) ® ior / “ e FPO de. (6.58) 
Oo Ja—1/2 


where 4 = np and o = ./npq. Changing variables from x to z = (x — 1)/o, we 
can write the expression on the right-hand side in terms of the standard normal df. 
and we obtain 


Corollary 6.3.1 (Normal Approximation with Continuity Correction). For large 
values of n 


b4+i- png ee 
Pass sH~o/ 2 “) o(! 2 “). (6.59) 
(on Oo 


Remarks. 


1. The term 1/2 in the arguments on the right-hand side is called the correction for 
continuity. It may be ignored for large values of o , say, foro > 10, unless b—a 
is small, say,b —a < 10. 

2. The closer p is to 1/2, the better the approximation. For p between 0.4 and 0.6, 
it can be used for n > 25, but for p © 0.1 or 0.9 it is good for n > 50 only. 
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Example 6.3.1 (Coin Tossing). We toss a fair coin n = 100 times. Letting S, denote 
the number of heads obtained, find the normal approximation to P(45 < S, < 55). 
Here p = 1/2, 4 =np = 50 ando = ,/npqg = 5. By Equation 6.59, 


S562 = 50 AS = 2 = 56 
Pa<s <h~o/ - 0 = 


= O(1.1) — ®(-1.1) =2- @(1.1) — 1 ¥ 0.72867. (6.60) 


This result is an excellent approximation to the exact value 


55 100 1 100 
2 2) SOIR TS evus (6.61) 
k=45 k 2 


It is also interesting to compare the approximation of Equation 6.56 with the 
binomial value. For instance, for k = 50 we have 


= eee = = 0.079589 (6.62) 
72) me 
and from Formula 6.56 we get 
i 0 
~ ———e * © 0.079788. (6.63) 
PSO Jin O5 
We can also use Formula 6.59 with a = b = 50 to approximate p59. This method 
yields 
» (045-50), (50=5-50 
P50 ~ 5 5 

= 0(0.1) — ®(—0.1) = 2- (0.1) — 1 © 0.079656, (6.64) 
a slightly better approximation then the preceding one. 4 


Example 6.3.2 (Difference of Two Polls). Suppose that two polling organizations 
each take a random sample of 200 voters in a state and ask about their preference 
for a certain candidate, with a yes or no answer. Find an approximate upper bound 
for the probability that the proportions of yes answers in the two polls differ by more 
than 4%. 

Denoting the proportions of yes answers in the two polls by X and Y, we 
are interested in the probability P(|X — Y| > 0.04), which can be written as 
P(|200X — 200Y| > 8), where 200X and 200Y are ii.d. binomial random vari- 
ables with parameters n = 200 and an unknown p. The mean of the difference is 0 
and the variance is 400 p(1 — p). Thus, we can standardize the desired probability 
and apply Theorem 6.3.1:° 


5 By Theorem 6.3.1, 200X and 200Y are both approximately normal, and they are also inde- 
pendent, hence their difference is also approximately normal. 
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20/pd—p)  20/pa—p) 
~P(izI > 


es. as |200X — 200Y| 8 
P([X —Y| > 0.04) = P 


2 
Il (6.65) 
Sv pC — =) 

Now, it is easy to show that p(1—p) is maximum when p = 1/2.S02/(5./p(1 — p)) 
is minimum and then 2/(5./p(1 — p)) = 4/5. Hence 


Bu tee 4 
P(|X — Y| > 0.04) < P (z = =) = 20(0.8) — 1 © 0.576. (6.66) 


Thus, there is a rather substantial chance that the two polls will differ by more 
than 4 percentage points. 4 


As mentioned in Footnote 3 on page 193, Laplace discovered a very important 
generalization of Theorem 6.3.1, the first version of which, however, was proved 
only in 1901 by A. Liapounov. This generalization is based on the decomposition 
of a binomial random variable into a sum of i.i.d. Bernoulli random variables as 
= = X; where X; = | if the ith trial results in a success and 0 otherwise. 
Now, when we standardize S,, we divide by ./npg and so, as n — ov, we have 
an increasing number of smaller and smaller terms. What Laplace noticed was that 
the limiting distribution is still normal in many cases even if the X; are other than 
Bernoulli random variables. This fact has been proved under various conditions on 
the X; (they do not even have to be i.i.d.) and is known as the central limit theorem 
(CLT). We present a version from 1922, due to J. W. Lindeberg. 


Theorem 6.3.2 (The Central Limit Theorem). For any positive integer n, let 
X1,X2,...,Xn be ii.d. random variables with mean s and standard deviation o 
and let S* denote the standardization of their sum, that is, let 


* 1 
St= Tig (x xix on) . (6.67) 


i=l 
Then, for any real x, 


lim P(S* <x) = (x). (6.68) 
n> oo 


Proof. Again, we just give an outline of the proof and omit some difficult technical 
details. 

We are going to use moment generating functions to deal with the distribution 
of the sum in the theorem because, as we know, the m.g.f. of a sum of independent 
r.v.s is simply the product of the m.g.f’s of the terms. Now the assumption of the 
existence of jz and o does not guarantee the existence of the m.g.f. of X; and, in 
general, this problem is handled by truncating the X;, but we skip this step and 
assume the existence of the m.g.f. of X; or, equivalently, the existence of the m.gf. 
vy of the standardization X* = (X — y)/o. 
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Then 


1 Z 1 = 
s* = —_ Xj - = —— x*, 6.69 
: ee fa) Vi (6.69) 


and so the m.g.f. , of S* is given by 


t n 
Yn(t) = E (=)| (6.70) 


Now, (0) = 1, W’() = E(X#) = O and y’"(0) = E((X?)?) = 1. Hence 
Taylor’s formula gives 


w'"(c) 3 
a (6.71) 


j=1 ipa 
W(t)= 5 


where c is some number between 0 and tf. From here, 


1 t2 ysl! (c) t 3” 
th=]1 : 6.72 
Un(0) fe goe : (=) (6.72) 
and with some calculus it can be shown that 
lim Un(t) = ef /?. (6.73) 
noo 


The expression on the right-hand side is the m.g.f. of the standard normal distri- 
bution, and so the limiting distribution of S*, as n — oo, is the standard normal 
distribution. a 


In statistical applications, it is often the mean of a random sample (see Definition 
6.2.3) rather than the sum that we need and, fortunately, the distribution of the sample 
mean also approaches a normal distribution: 


Corollary 6.3.2. Let X1, X2,... , Xn be as in the theorem above, let X,= (1/n) - 
yoy, X; denote their average and let 
>* Xa He 

= 6.74 
n a/J/n ( ) 

be the standardization of X ,. Then, for any real x, 
lim P(X, <x) = (2). (6.75) 

noo 


Proof. 


= n 1 n n n 
xX, = ae ( Yo Xi “) =n ) X; -) = S*, (6.76) 
OS NEI OTN] 
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Example 6.3.3 (Total Weight of People in a Sample). Assume that the weight of the 
adults in a population has mean 4 = 150 pounds and s.d.o = 30 pounds. Find 
(approximately) the probability that the total weight of a random sample of 36 such 
people exceeds 5700 pounds. 

The weight of any individual is not a normal r.v. but a mixture (that is, the p.df. 
is a weighted average) of two, approximately normal, random variables: one for the 
women and one for the men. This fact is, however, immaterial because by the CLT, 
the total weight W is approximately normal with mean nu = 36- 150 = 5400 and 
SD = Jno = /36 - 30 = 180. Thus, 


W—5400 5700 — 5400 
P(W > 5700) = P( a = ) (6.77) 
~ P(Z > 1.667) = 1 — (1.667) © 0.048. (6.78) 
4 


The law of large numbers is a straightforward consequence of the CLT whenever 
the latter holds: 


Corollary 6.3.3 (Law of Large Numbers). For any positive integer n, let X\, X2, 
...,Xpn be iid. random variables with mean jz and standard deviation o . Then, for 
any € > 0, their mean X, satisfies the relation 


lim P(\X,n —ul| <€) = 1. (6.79) 
n—->oco 


Proof. By Equation 6.74 


5 en ee (6.80) 


n 
lim P(X. = fim P| |x 
ge ja tl 


lim p(-~s < he < we) 
o o 


noo 


(00) — ®(—o0) = 1. (6.81) 


and so 


Example 6.3.4 (Determining Sample Size). Suppose that in a public opinion poll the 
proportion p of voters who favor a certain proposition is to be determined. In other 
words, we want to estimate the unknown probability p of a randomly selected voter 
being in favor of the proposition. We take a random sample, with the responses being 
iid. Bernoulli random variables X; with parameter p and use X,, to estimate p. 
Approximately how large a random sample must be taken to ensure that 


P(|X, — p| < 0.1) > 0.95? (6.82) 
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By the CLT 


_ —— 0.1 
PUR, rl <0.) =P ee. 


Vp(l—p)/n Jp p)/n 


0.1 
2 P( (2) 
( <a) 


0.1 
=2® ( ) 1. (6.83) 
Vv p(l— p)/n 
Now, this quantity is > 0.95 if 
® ( a ) > 0.975 (6.84) 
/pd = pyjn} ~~ , 
or, equivalently, if 
ue > &-!(0.975) © 1.96 (6.85) 
Vp — p)/n ~ , , , 
or 
1.96 
Jn 2 Gree" — p) (6.86) 
or 
n = 384.16p(1 — p). (6.87) 


Here p(1 — p) has its maximum at p = 1/2, and then p(1 — p) = 1/4. Thus 
n= 97 (6.88) 


ensures that P(|X, — p| < 0.1) > 0.95 for any value of p. 

The lower bound for n obtained by the normal approximation above is actually 
the same as the precise value given by the binomial distribution. Indeed, for n = 97 
and p = 1/2 a computer evaluation gives P(|X, — p| < 0.1) = 0.958... and for 
n = 96 and p = 1/2 it gives P(|X, — p| < 0.1) = 0.948.... 

If we know an approximate value in advance for p that is far from 1/2, then 
Formula 6.87 can be used to obtain a lower value for the required sample size. 


Exercises 
Exercise 6.3.1. A die is rolled 20 times. Find the probability of obtaining 3 sixes, 


both by the binomial p.f. and by the normal approximation with continuity correction 
(Equation 6.59). 


64 Negative Binomial, Gamma and Beta Random Variables 201 


Exercise 6.3.2. A die is rolled 20 times. Find the probability of obtaining 3, 4, or 
5 sixes, both by the binomial p.f. and by the normal approximation with continuity 
correction (Equation 6.59). 


Exercise 6.3.3. Choose 100 independent random numbers, uniformly distributed on 
the interval [0, 1]. What is the approximate probability of their average falling in the 
interval [0.49, 0.51]? 


Exercise 6.3.4. The height of 100 persons is measured to the nearest inch. What is 
the approximate probability that the average of these rounded numbers differs from 
the true average by less than 1%? 


Exercise 6.3.5. A scale is calibrated by repeatedly measuring a standard weight of 
10 grams and taking the average X of these measurements. Due to unpredictable 
causes, such as changes in temperature, air pressure and friction, the individual mea- 
surements vary slightly. They are taken to be independent random variables with 
o = 6g each.® 


1. How many weighings are needed to make oy < 0.5ug? 
2. How many weighings are needed to make P(|X — 10g| < 0.5ug) > 0.9? 


Exercise 6.3.6. In Example 6.1.2 we obtained the exact answer to the question of 
finding the probability that on a certain night 40 or fewer diners will show up at a 
restaurant if the number of diners is Poisson with A = 50. Answer the same question 
approximately, by using the CLT and the fact that a Poisson r.v. with 2 = 50 is the 
sum of 50 independent Poisson r.v.’s with A = 1. 


6.4 Negative Binomial, Gamma and Beta Random Variables 


In this section, we shall discuss three other named families of random variables that 
occur in various applications. 

The negative binomial distribution is a generalization of the geometric distribu- 
tion: In a sequence of i.i.d. Bernoulli trials we wait for the rth success, rather than 
just the first one. The probability that the rth success occurs on the kth trial equals 
the probability that in the first k — | trials we have exactly r — 1 successes and the 
rth trial is a success, that is, Cong Pea) times p. Thus, we make the 
following definition: 


Definition 6.4.1 (Negative Binomial Random Variables). Suppose we perform 
iid. Bernoulli trials with parameter p, until we obtain r successes, for a fixed pos- 
itive integer r. The number X,. of such trials up to and including the rth success 
is called a negative binomial random variable’ with parameters p and r. It has the 
probability function 


6 lug = | microgram = 10g. 
7 Some authors define a negative binomial r.v. as the number of failures before the rth suc- 
cess, rather than the total number of trials. 
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k—1 rk—r 
fk) = P(X =k) = 1 Pq fork =r,r+1,r+2,.... (6.89) 
r— 


The distribution of X; is called negative binomial, too. 


The reason for the name “negative binomial” is that f(k) can be written as 


f(ki)= (," orca fork=r,r+1,r+2,..., (6.90) 
—r 


with the definition of binomial coefficients extended for negative numbers on top as 


i! i 


for nonnegative integers r and i, and the binomial theorem can also be extended® for 
negative exponents as 


d+x7=>)> (7) (6.92) 
i=0 


From Equations 6.90 and 6.92, 


s. a roa _ p >. (7) (—q)! = pl = q)" = 1, (6.93) 
kar ST i=o \?! 


and so the probabilities f (k) do, indeed, add up to | and are p’ times the terms of a 
series for a binomial expression with a negative exponent. 

Clearly, the geometric distribution is a special case of the negative binomial, with 
r = 1. Also, X; is the sum of r i.i.d. geometric random variables Z1, Z2,... , Z; 
with parameter p, because to get r successes, we first have to wait Z, trials for the 
first success, then Z> trials for the second success, independently of what happened 
before, and so on. Thus, we can easily compute EF (X,) and Var(X,.) asr times E(Z,-) 
and Var(Z,), and so, by Example 5.1.11 and Exercise 5.3.4, 


re os ee (6.94) 
P 
and 
rq 
Var(X,) = 4. (6.95) 
P 


Similarly, from Example 5.3.2, the m.g-f. of X;- is the rth power of the m.g.f. of Z;: 


t r 
va) = (4) (6.96) 


8 Expand (1 + x)" ina Taylor series. 
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Example 6.4.1 (Number of Children). A couple wants to have two boys. Find the 
distribution of the number of children they must have to achieve this goal. Assume 
that the children are boys or girls independently of each other and P(boy) = 1/2. 

Clearly, the number of children is a negative binomial random variable with pa- 
rameters p = 1/2 andr = 2. Thus, with f(k) denoting the probability of needing k 
children, we have 


k 
f(k)=(k-1) (;) fork =2,3,4,.... (6.97) 


Furthermore, the expected number of children they need in order to have two 
boys isr/p = 4. 4 


The next type of random variable we want to consider is called a gamma random 
variable because its density contains the so-called gamma function, 


CO 
ro= | x e"de fore > 0, (6.98) 
0 


This integral cannot be evaluated in terms of elementary functions, but only by 
approximate methods, except for some specific values of t , which include the positive 
integers. 

Integration by parts yields the reduction formula 


ra¢+1)=rl(t) fort > 0, (6.99) 
and from here, using the straightforward evaluation (1) = 1, we obtain 
l(r)=(r—-1)! 16 ef ae (Ne |e (6.100) 


Thus, the gamma function is a generalization of the factorial function from integer 
arguments to positive real arguments. 


Definition 6.4.2 (Gamma Random Variables). A continuous random variable with 
density function 


ifx <0 


FO) =) yerr(a))xtle ifx > 0 


(6.101) 


is called a gamma random variable and f(x) the gamma density with parameters a 
and A, for any reala > O andi > 0. 


The essential part of this definition is the fact that f(x) is proportional to 
x°—le** for x > 0, and the coefficient A*/I'(a) just normalizes this expression. 
Indeed, with the change of variable u = Ax, we get 
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[ a geo egy = a [ en au lay 
0 T@) T (a) Jo Xr Xx 
1 [o.@) 
= = | utle"'du = 1. (6.102) 
a 0 


Clearly, for «@ = 1 the gamma density becomes the exponential density with 
parameter A. More generally, for a = n a positive integer, the gamma density turns 
out to be the density of the sum of n i.i.d. exponential r.v.’s with parameter 1, as 
shown below. Hence, for a = n, a gamma rv. is a continuous analog of the negative 
binomial: It is the waiting time for the occurrence of the nth arrival in a Poisson 
process with parameter 1. 


Theorem 6.4.1 (Gamma as the Sum of Exponentials). For any positive integer n, 
let T,, Tz, ... , Ty be i.i.d. exponential random variables with density 


0 ift <0 


6.103 
he. ra, ( ) 


=| 


Then Sy, = T, + Ty +-+-+ 7, is a gamma random variable with parameters a =n 
and x, that is, its density is 


Q"/(a =e ft > 0. ns 


if t < 
falt) = ? oe 
Proof. We use induction. 
For n = 1| Equation 6.104 reduces to Equation 6.103, which is the density of 
S; = T,, and so the statement is true in this case. 
Now, assume that Equation 6.104 is true for arbitrary n. Then the convolution 
formula (Equation 4.112) gives 


t t An _ _ _ _ 
fut) = [ fr @)filt — x)dx =i je DI le ne 
0 0 =A 
etl t nr 
= ae f x" dx = ate fort > 0. (6.105) 
= i 0 ni 


The expression on the right-hand side is the same as the one in Equation 6.104 with 
n+ 1 in place of n. Thus, if Equation 6.104 gives the density of S,, for any n, then 
it gives the density of S,,41, with n + 1 in place of n, too, and thus gives the density 
of S, for every n. |_| 


The preceding theorem implies that the sum of two independent gamma random 
variables, with integer w values, say m and n, and a common A, is gamma with 
a =m-+nand the same 4, because it is the sum of m + n 1.i.d. exponential random 
variables with parameter A. The sum is still gamma even if the parameters are not 
integers. 
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Theorem 6.4.2 (Sum of Independent Gamma Variables). For any r,s > 0, let R 
and S be two independent gamma random variables with parameters a = r and 
a = s, respectively, and a common x. Then T = R+ S is gamma with parameters 
r+sand i. 


Proof. By the convolution formula (Equation 4.112), fort > 0, 


r 


; 8 
x! 1am Ax (t = xs le AOD x 


t t 
fr) = f fats) falt—s)dx = [ rr) T(s) 


2 oe f xe — x)*~!dx. (6.106) 
P(r)P(s) 0 


In the last integral we change the variable x to u by substituting x = tu and dx = 
tdu. Then we get 


r+s 
et 


1 
ral s—l 
Tore [ (tu) “(tt — tu)” “tdu 


ee aa Ean iG +s) 
l(r+s) 9 T(r)P(s) 


froO= 


wd —u)~!du. (6.107) 


Here the function [A’*’ /'(r + s)]t”’+’~!e~*! is the gamma density with parameters 
r +s and X. Since the whole expression on the right-hand side is a density as well, 
we must have 


[ Tits) gw det (6.108) 
0 T@)r(s) 
and 
are 
fO= Tate (6.109) 
| 


We have an important by-product of the above proof: 


Corollary 6.4.1. 
: rwn)r 
i uw! —u)s~!du = ror (6.110) 
0 rr +s) 
This integral is called the beta integral and its value 
Pr)r(s) 
Br, s) = ———— 6.111 
(r,s) TG +s) ( ) 


is the beta function of r and s. It will show up again shortly in the density of beta 
random variables. 
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Example 6.4.2 (Expectation and Variance of Gamma Variables). For the X; in Ex- 
ample 6.4.1, E(X;) = 1/A and Var(X;) = 1/27, so if T is gamma with parameters 
a = n and arbitrary A, then E(T) = n/d and Var(T) = n/d*. These expressions 
remain valid for arbitrary @ in place of n. For a gamma random variable T with 
arbitrary a and A, we have, with the change of variable u = Ax, 


E(T)= he A xe dx = a 7. (=) e" i 
>» Fe) Ta Jo Ma d 
=z.-! ee. Le a 


Similarly, 
a 
Var(T) = a2: (6.113) 


(The proof is left as an exercise.) 


Another important case in which we obtain a gamma random variable is de- 
scribed in the following theorem. 


Theorem 6.4.3 (Square of a Normal Random Variable). Let X be an N(0, 07) 
random variable. Then Y = X* is gamma with a = 1/2 and i = 1/(207). 


Proof. By Equation 4.49 


ve | EEE + fx(-/9)] = ; ° (6.114) 
In the present case, 
fx(x) = ae for —co <x <a, (6.115) 
and so 
fr) = oo . s ; (6.116) 


For y > 0, this density is proportional to y~!/*e7¥/ 20° and is therefore gamma with 
a = 1/2anda = 1/(20”). | 


Since the coefficient in Definition 6.4.2 with a = 1/2 and A = 1/(207) and the 
coefficient in Equation 6.116 normalize the same function, they must be equal, that 
1s, 


(202)!/2 _ 1 


= 6.117 
Td/2 ~ oJin nny 
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must hold. Hence, we must have 
1 
Tr (5) = /71. (6.118) 


Using Equation 6.118 and the reduction formula, Equation 6.99, we obtain (Ex- 
ercise 6.4.11) the values of the gamma function for positive half-integer arguments: 


€ b ') Ja (2k)! 
r = 
2 22k K| 


fork =0,1,2,.... (6.119) 


In various statistical applications, we encounter the sum of the squares of inde- 
pendent standard normal random variables. As a consequence of Theorems 6.4.3 and 
6.4.2, such sums have gamma distributions, but we have a special name associated 
with them and their densities are specially tabulated. Thus, we have the following 
definition and theorem: 


Definition 6.4.3 (Chi-Square Random Variables). For independent standard nor- 
mal random variables Z,, Z2,... , Z,, the sum ei = a + Ze +...+ Zz is called 
a chi-square random variable with n degrees of freedom. 


Theorem 6.4.4 (Chi-Square is Gamma). The distribution of x2 is gamma with pa- 
rameters a = n/2 and X= 1/2, and its density is 


ifx <0 


0 
1/2"? PT (n/2)y)xl “Ne 42 fx > 0. ae 


2) = 


Proof. By Theorem 6.4.3, Zz is gamma with parameters a = 1/2 and A = 1/2, for 
all i, and so, by repeated application of the result of Theorem 6.4.2, the statement 
follows. Oo 


Corollary 6.4.2 (Expectation and Variance of Chi-Square). E(x?) = n and 
Var ( x7) = 2n. 


Proof. These values follow at once from Example 6.4.2 and Equation 6.120. a 
Corollary 6.4.3 (Density of xn). The density of xn = \/ x? is 


0 ifx <0 


[2/(2" F(n/2)) |x" be?" /? fx > 0. (6.121) 


Fyn @) = | 


We leave the proof as Exercise 6.4.10. 


Example 6.4.3 (Moment Generating Function of Chi-Square). For independent stan- 
dard normal random variables Z, Z2,... , Zn, the m.g.f. of x? = Zi+Z5+- : -+Z2 
is given by 


Un) = E (edt a) = Il E (c7"") (6.122) 


i=l 
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Here 
2 1 On 2 1 ~ 2 1 
E (c7'") = al ete % /2dz = — | e* (= 21)/2q7 fort < ~. 
V 20 —0o V Qn —oo 2 
(6.123) 
Making the change of variable u = z./1 — 2t, we get 
2 1 ve 2 1 1 
E (¢7"") = —/ edu = ——— fort< <=. (6.124) 
V20J/1 — 2t J—co 1—2t 2 


Thus, 


= n 2 7 l 7 = 1 
wnt = [TE (¢ le (=) 797A fort <5. (6.125) 


4 


Gamma random variables, with values of the parameter @ not just integers or half- 
integers, are often used to model continuous random variables with an unknown or 
approximately known distribution on (0, oo). Similarly, continuous random variables 
with unknown distribution on [0, |] are often modelled by beta random variables, to 
be defined below. This is especially true in some statistical applications of Bayes’ 
theorem, in which the prior probability P of an event is taken to be a random variable 
with such a distribution on [0, 1], and then the posterior distribution turns out to be 
beta, too. (See Example 6.4.4 below.) 


Definition 6.4.4 (Beta Random Variables). A continuous random variable with 
density function? 


_ JC/Bo,s))x" 1d —x)s-! if0<x <1 


: (6.126) 
0 otherwise 


f(x) 


is called a beta random variable and f(x) the beta density with parameters r and s, 
for any realr > 0 and s > O. Here 


_ POP) 


Bir, s) = Te+s) 


(6.127) 


Notice that the beta distribution with r = s = 1 is the uniform distribution on 
[0, 1]. 


Example 6.4.4 (Updating Unknown Probabilities by Bayes’ Theorem). Suppose the 
probability P of an event A is unknown and is taken to be a uniform random variable 
on [0, 1], which is a (somewhat controversial) way of expressing the fact that we 


9 We assume 0° = | where necessary. 
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have no idea what the value of P is. Assume that we conduct n > | independent per- 
formances of the same experiment, and obtain k successes, that is, obtain A exactly 
k times. How should we revise the distribution of P in light of this result? 

We have already treated this problem for n = | in Example 4.6.3. The computa- 
tion for general n > 1 is similar; we just use a binomial distribution for the number 
X of successes instead of the Bernoulli distribution used there. Thus, 


Hh k n—k 
fyiplk, p) = (i)e (dd — p) fork =0,1,...,n (6.128) 
and 
1 for p € [0, 1] 
fp(p) = ae (6.129) 
0 otherwise. 
By Equation 4.142, (the a) in the numerator and denominator cancel) 
k n—k 
ie 
aa P) for p €[0, l]andk =0,1,...,n 
fr\x(p,k) = 4 fy pk — p)"-*dp 
0 otherwise. 


(6.130) 


Thus the posterior density of P is beta with parameters r = k+ 1 ands = 
n—-k-+1. 


Example 6.4.5 (Expectation and Variance of Beta Variables). The expected value is 
very easy to compute, because the relevant integral produces another beta function. 
Thus, if X is beta with parameters r and s, then 


1 
BQ) = ges [xed ota = BN) 
oe BC.) 
_Tr+brs) Pe+s) or 


= = : (6.131) 
Trart+s+1) FOr) rts 


Similarly, 


rs 


Var(X) = Farrar: 


(6.132) 


We leave the proof of this formula as Exercise 6.4.18. 


Exercises 


Exercise 6.4.1. Find the probability of obtaining, in i.i.d., parameter p Bernoulli tri- 
als, r successes before s failures. 
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Exercise 6.4.2. Let N, be the number of i.i.d., parameter p Bernoulli trials needed 
to produce either r successes or r failures, whichever occurs first. Find the p-f. of N,. 


Exercise 6.4.3. Let Y, be the number of failures ini.i.d., parameter p Bernoulli trials 
before the rth success. Find the pf. of Y,. 


Exercise 6.4.4. 


1. A die is rolled until 6 shows up for the second time. What is the probability that 
no more than eight rolls are needed? 

2. How many rolls are needed to make the probability of getting the second 6 on or 
before the last roll exceed 1/2? 


Exercise 6.4.5. For any positive integers r and s, let X; be the number of 1.i.d. 
Bernoulli trials with parameter p up to and including the rth success and X;-+; their 
number up to and including the (r + .s)th success. Find the joint p.f. of X,; and X;-+5. 


Exercise 6.4.6. Let S,, denote the number of successes in the first m of m + n i.i.d. 
parameter p Bernoulli trials for any positive integers m and n. Find the p-f. of Si, 
under the condition that the rth success, for any positive r < m + n, occurs on the 
(m + n)th trial. 


x-value where f(x) takes on its maximum) is (a — 1)/A. 


Exercise 6.4.7. Show that, for @ > 1 the mode of the gamma density (that is the 


Exercise 6.4.8. Sketch the gamma density for the following (a, 4) pairs: C1, 1), 
(1,2), 2, 1), CL, 1/2), 1/2, 1), (1/2, 2), (4, 4). 


Exercise 6.4.9. For a gamma random variable T with arbitrary w and i, prove that 


1. E(T*) =[a(at1)---(@tk— 1)]/ak for any positive integer k. 
2. Var(T) = a/A?. 
3. The m.gf. of T is w(t) = [A/(A —1)]® fort <A. 


Exercise 6.4.10. Prove Corollary 6.4.3. 
Exercise 6.4.11. Prove Equation 6.119. 


Exercise 6.4.12. Choose a point at random in the plane, with its coordinates X and 
Y independent standard normal random variables. What is the probability that the 
point is inside the unit circle? 


Exercise 6.4.13. Let X and Y be i.i.d. N(O, 0”) normal random variables. Find the 
density of U = X?+Y?. 


Exercise 6.4.14. Let X1, X2,... , Xn, be iid. N(O, 02) normal random variables. 
Find the density of V = )77_, X?. 


Exercise 6.4.15. Let X1, X2,... , X,, be ii.d. uniform random variables on [0, 1]. 
Show that Y = max(X,, X9,..., X,) and Z = min(X1, X2,... , X,) are beta and 
find their parameters r and s. 
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Exercise 6.4.16. Show that the mode of the distribution given by Equation 6.130 
(that is, the p-value where fp\x(p,k) takes on its maximum) is the relative fre- 
quency k/n. (You need to treat the cases k = 0 and k = n separately from the 
others!) 


Exercise 6.4.17. Sketch the beta density for the following (r, s) pairs: (1, 2), (2, 1), 
(2,2), (1,3), 1/2, 1), (11, 21). 


Exercise 6.4.18. For a beta random variable X with arbitrary r and s, prove that 


1 E(XS =[rrt+1)---@+k-DU/IC +9¢ +54+1)-:-¢ +5 +k—1)] for 
any positive integer k. 
2. Var(X) =rs/[7 +s)*(r +5 +1)]. 


Exercise 6.4.19. Modify Example 6.4.4 by taking the prior distribution to be beta 
with arbitrary, known values r and s. Find the posterior distribution of the event A 
if in n > 1 independent performances of the same experiment we obtain k < n 
successes. 


6.5 Multivariate Normal Random Variables 


In many applications, we have to deal simultaneously with two or more normal ran- 
dom variables whose joint distribution is a direct generalization of the normal dis- 
tribution. For example, the height and weight of a randomly selected person is such 
a pair, and so are the test scores of a student on two exams in a math course, and 
the heights of a randomly selected father-son pair. Also, in statistical samples from a 
normally distributed population the joint observations follow a multivariate normal 
distribution. 

We take a somewhat indirect, but mathematically convenient, approach to defin- 
ing bivariate normal random variables. We start with two independent standard nor- 
mal random variables and transform them linearly. 


Definition 6.5.1 (Bivariate Normal Random Variables). Let Z; and Z2 be inde- 
pendent standard normal random variables and a11, a@12, 421, 422, 0; and b2 any con- 
stants satisfying a + Gis #0, ai + Gea # 0 and aj1a22 — aj2a2; ~ 0. Then 


X, =ayjZ, + aj2Z2 + by 


and 
Xq = a71Z, + a22Z2 + bg (6.133) 


are said to form a bivariate normal pair. 


By Theorems 6.2.4 and 6.2.6, the marginals X; and X2 are (univariate) normal 
with means zy = by and (42 = bo and variances a; — ay +a, and Ge = ain +a},, 
respectively. Furthermore, 01.2 = Cov(X 1, X2) = 41421 + 412422, by the definition 
of Z; and Z2 as independent standard normal random variables, and the correlation 


212 6 Some Special Distributions 


coefficient of X; and X2 is p = (a11d21 + a12422)/o102. Note that po # +1 by 
Corollary 5.4.1 and the requirement that a},a22 — aj2a2; 4 0. 

In Theorem 5.4.2 we saw that for independent random variables X and Y whose 
expectations exist, Cov(X, Y) = 0. One of the most important properties of bivariate 
normal random variables is that the converse of this fact holds for them: 


Theorem 6.5.1 (For Bivariate Normal Random Variables, Zero Covariance Im- 
plies Independence). Jf X, and X2 are bivariate normal, then Cov(X,, X2) = 0 
implies their independence. 


Proof. We are going to use the bivariate moment generating function 
ws, th=E ew) (6.134) 


to prove this theorem. 

By Theorem 6.2.6, Y = sX, +1X2 is normal, because it is a linear combination 
of the original, independent, random variables Z; and Z2. Clearly, it has mean wy = 
S[l4y + tz and variance oe = sa; + Pa, + 2sto;,2. (Here we wrote 01,2 for 
Cov(X1, X2).) Thus, by the definition of the m.g.f. of Y as wy(t) = E(e'”) and by 


Equation 6.46, 
W(s, t) = py (1) = eer ttuat Por +07 +2st01,2)/2, (6.135) 
Hence, if 01.2 = 0, then w(s, t) factors as 
(5,1) = MITT /Agturtto7/2, (6.136) 


which is the product of the moment generating functions of X; and X2. 
Now, if X; and X2 are independent, then, clearly, 


wG,H=E (eres) =F (e°%1e'%2) —# (e**') E (e"*?) 


= esi tsop Aghia (6.137) 
the same function as the one we obtained above from the assumption 01,2 = 0. By 
the uniqueness of moment generating functions, which holds in the two-dimensional 
case as well, X; and X2 must therefore be independent if 01,2 = 0. Oo 


Note that this theorem does not say that if X; and X2 are only separately, rather 
than jointly, normal, then Cov(X;, X2) = O implies their independence. That state- 
ment is not true in general, as Exercise 6.5.4 shows. 

Define two new standard normal random variables as 


and 


1 
aa (2 sue) are (2 = ue) za] (6.138) 
is p 02 (onl o2 eal 


6.5 Multivariate Normal Random Variables 213 


One can check by some straightforward calculations (Exercise 6.5.1) that Y; and 
Y> are indeed standard normal and Cov(Yj, Y2)=0. Thus, by Theorem 6.5.1, Y; and 
Y2 are independent. Furthermore, we can write X; and X2 in terms of Y; and Y2 as 


X, =o, + 1, 


X72 = 02 (on +i = p*¥2) + [2. (6.139) 


From here we can easily obtain the conditional expectation, variance and density 
of X2 given X; = x, because if X; = x1, then Yj = (x1 — 1)/o; and 


a 
wea (0 — ia a) ies (6.140) 
1 


Thus, since Y> is independent of Y1, it is unaffected by the condition X; = x,, and 
so, under this condition, X2 is normal with mean 


and 


X1— PI 
E(X2|X1 = x1) = “2 + por = (6.141) 
and variance 
Var(X2|X1 = x1) = (1 — p”)o?. (6.142) 
Hence 
1 (x2 [2 pon ii=Ht ) 
(6.143) 


Fxo|x, 42, 41) = exp 
= V2 (1 — p2)oo 2(1 — p?)os 


Observe that E (X2|X1 = x1) is a linear function of x;.The graph of this function 
in the x;x2-plane is called the regression line of X2 on X,, and its equation can also 
be written in the form 


2. (6.144) 


02 oO! 


or as 
22 = PZ1, (6.145) 


where we write z; and z2 as the standardizations of x; and x2. 

Thus, the regression line goes through “the point of averages” (j11, (42) and has, 
in standard units, slope p. 

Note, furthermore, that the conditional variance Var(X2|X 1 = x 1) is the same for 
every value of x1. Statisticians call this property of the bivariate normal distribution 
homoscedasticity (Greek for “same scatter’). Other bivariate distributions generally 
do not have this property; such distributions, for which Var(X2|X , = xj) is a non- 
constant function of x;, are called heteroscedastic. 
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Multiplying the conditional density in Equation 6.143 by the marginal density 


sip (x1 — 441)? 
/ 2001 20? 


we obtain the joint density of X; and X2. Thus, we have proved the following theo- 
rem: 


fx, (1) = ; (6.146) 


Theorem 6.5.2 (Bivariate Normal Density). /f X and X2 form a bivariate normal 
pair with variances 01, 02 and correlation coefficient p # +1, then their joint density 
is given by 


1 
2/1 — p*)o\02 


x exp -1 (ace) > (2) (=) 
Pag —paL\ a1 Va on 
ss 2 
+(28 ta) |}. (6.147) 
02 


Clearly, if X; and X> have a joint density like this one, then we can write them 
as in Equations 6.139 in terms of standard normal Y; and Y2, which shows that X 
and X» are a bivariate normal pair. In fact, many books define bivariate normal pairs 
as random variables that have a joint density of this form. 

Notice the symmetry of f (x1, x2) with respect to interchanging the subscripts | 
and 2. Consequently, the conditional expectation, variance and density of X1 given 
X2 = X2 can be obtained from the previous conditional expressions simply by inter- 
changing the subscripts | and 2. 


f (1, x2) = 


Example 6.5.1 (Two Exams). The scores on two successive exams taken by the stu- 
dents of the same large class usually approximate a bivariate normal distribution. 
Assume that X; and X2, the scores of a randomly selected student on two exams, 
are bivariate normal with 4; = 2 = 70,0, = o2 = 12 and p = 0.70. Suppose a 
student scored 90 on the first exam, what is his expected score on the second exam 
and what is the probability that he will score 90 or more on the second exam? 

The conditional expected score on the second exam is given by Equation 6.141: 


90-70 _ 
12 


Notice that the high score of 90 on the first exam gives a mean prediction of 
only 84 on the second exam. This phenomenon is universal for bivariate normal 
variables: given an “extreme” value of one of the variables (as measured in standard 
units), the expected value of the other variable will be less extreme. Hence the name 
“regression.” 

The conditional variance of X> is given by Equation 6.142: 


E(X2|X1 = 90) = 70 +0.70- 12- 


84. (6.148) 


Var(X2|X1 = 90) = (1 — 0.70) 12? = 73.44. (6.149) 
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Thus, under the condition X; = 90, X2 is normal with w = 84 ando = 
V73.44 © 8.57. Therefore 


90 — 84 
8.57 


P(X > 90/X; = 90) =1-—@ ( ) = 0.242. (6.150) 
Example 6.5.2 (Heights of Husbands and Wives). In a statistical study, the heights 
of husbands and their wives were measured and found to have a bivariate normal 
distribution. Let X; denote the height of a randomly selected husband and X2 the 
height of his wife, with uw; = 68”, 42 = 64”, 0, = 4”, op = 3.6", p = 0.25. (The 
slight positive correlation can be attributed to the fact that, to some extent, taller 
people tend to marry taller ones, and shorter people shorter ones.) 
Given these data, 


(a) what is the expected height of a man whose wife is 61” tall; 

(b) what is the probability of the wife being taller than her husband, if the husband 
is of average height; 

(c) what is the probability of the wife being taller than the third quartile of all the 
wives’ heights, if her husband’s height is at the third quartile of all the husbands’ 
heights? 


(a) The conditional expected height of a man whose wife is 61” tall is given by 
Equation 6.141, with the subscripts switched: 


61 — 64 


E(X1|X> = 61) = 68 +0.25-4. = 67.17. (6.151) 


(b) The conditional variance of X2, if the husband is of average height, is given 
by Equation 6.142: 


Var(X2|X1 = 68) = (1 — 0.257)3.67 © 12.15 (6.152) 


Thus, under the condition X; = 68, X2 is normal with w~ = 64 ando = 
V 12.15 © 3.49. Therefore 


68 — 64 
3.49 


P(X> > 68/X; = 68) =—1—© ( ) ~ 0.126. (6.153) 


(c) The z-value for the third quartile is, from P(Z < z) = 0.75, z0.75 © 0.6745. 
Thus, the third quartile of all the wives’ heights is x2,.9.75 = 2 + 0220.75 © 64+ 
3.6-0.6745 © 66.428” and the third quartile of all the husbands’ heights is x1,9.75 = 
141 +0120.75 © 68+ 4-0.6745 ~ 70.698”. Hence, under the condition X; = 70.698, 
X> is normal with w ~ 64+ 0.25 - 3.6 - [(70.698 — 68)/4] ~ 64.607 and o = 


VC — 0.252)3.62 © 3.4857. Therefore 
P(X2 > 66.428|X, = 70.698) 


66.428 — 64.607 
3.4857 


~1 


x 1 — ©(0.5224) + 0.321. (6.154) 
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Note that if we write X; and X> in standard units as Z; and Z2, then the con- 
dition X; = 70.698 corresponds to Z; = 0.6745 and under this condition, by 
Equation 6.144, Z2 is normal with w~ = pz; &© 0.25 - 0.6745 ~ 0.1686 and 


o = /(1 — 0.252) © 0.96825. Thus, 
P(X2 > X2,0,75|X1 = X1,0.75) = P(Z2 > 20.75|Z1 = 20.75) 
(“5 a) 


~al 
0.96825 


~ 1 — &(0.5224) + 0.321, (6.155) 


as before. As this calculation in standard units shows, the result does not depend on 
[L1, 42, 01, 62, but only on p. 


Example 6.5.3 (Density with a Homogeneous Quadratic Exponent). Let X; and X2 
have a joint density of the form 


=i 2 2 
fe, x2) = C exp | > (x1 — 2xpx) + 4x3) (6.156) 


where C is an appropriate constant. Show that (X 1, X2) is a bivariate normal pair 
and find its parameters and C. 

This problem could be solved by integration, but it is much easier to just compare 
the exponents in Equations 6.147 and 6.156, which is what we shall do. 

First, clearly, 443 = (42 = O, and the equality of the exponents requires that we 
solve 


1 x? x2 
( d prea + ) = ax? — 2bx1x2 + ex? for all x1, x2, 


(—=.¢")\.a? 0102 os 
(6.157) 
for the unknowns 01, 02 and p, witha = b = 1 and c = 4. Hence 
: b . : (6.158) 
a> === — sa Cc => —— i“ 
(1 — p?)o? (1 = p?)o102 (1 = p?)ox 
and so 
a | ac—b* 3 
2 2 
= =-, 1 = = 6.159 
e ac 4 ‘ ac 4 ( ) 
and 
2 1 c 4 2 1 a 1 
Oo; = 7) = ay) = - 07 = a) = TF = 
a(l—p-) ac—b 3 cU—p-) ac—b 3 
(6.160) 


Also since 0; > 0,02 > 0 and sign(p) = sign(b), we obtain oj = 2/V3, o2 = 
1/V3, p = 1/2 and C = 1/(22,/(1 — p2)o1 02) = V3/(2z). 

Thus, we have found the values of the parameters that make the density given by 
Equation 6.156 correspond to a bivariate normal density, thereby also showing that 
(X 1, X2) is a bivariate normal pair. 
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The method of the previous example yields the following generalization: 


Theorem 6.5.3 (Bivariate Normal Density with General Quadratic Exponent). 
A pair of random variables (X 1, X2) is bivariate normal if and only if its density is 
of the form 


—1 
f (x1, 92) = Cexp ( 5 [« (x1 — yi)? — 2x1 — pr) (2 — 2) + e(x2 — u2)*}) i 
(6.161) 


for any constants a, b, c, satisfying'® a > 0, ac—b? > 0, and C = Vac — b?/(2r). 


Example 6.5.4 (Density with an Inhomogeneous Quadratic Exponent). Let X and 
X> have a joint density of the form 


= 
f (41, x2) = Aexp (5 (3? — 2x 1x2 + 4x2 — 4x, + 10%2) ' (6.162) 


where A is an appropriate constant. Show that (X,, X2) is a bivariate normal pair 
and find its parameters and A. 

First, we want to put f (x1, x2) in the form of Equation 6.161. We expand the 
terms in Equation 6.161 and compare the result with Equation 6.162. Since variances 
and covariances do not depend on the values of jz; and j42 we can set “1 = “2 = Oin 
Equation 6.161. Thus we find that o;, 02 and p depend only on the quadratic terms, 
and therefore, together with a = b = | and cc = 4, are the same as in Example 6.5.3. 

To find jz; and 442, we may compare the first degree terms of the exponents in 
Equations 6.161 and 6.162. So, we must have that 


(—2ap, + 2bu2)x; = —4x, and (—2cu2 + 2bu4)x2 = 10x2, (6.163) 
that is, 
—emi+u2=-—2 and py —4u2=5. (6.164) 
Thus, 4; = 1 and w2 = —1 and 


) 


f (x1, x2) = Cexp ( [1 1)* — 21 — D2 +1) +402 + »"}) ; 


(6.165) 


where C = J/3/(2z) as in Example 6.5.3, and A is C times the exponential of the 
constant term in the quadratic expression above, i.e., 


ie Y3/Omyexo (> [« 1)? — 2 nay +4ay]) = V3 1, 


Thus, by putting f(x;, x2) in the form of Equation 6.161, we have shown that 
(X 1, X2) is a bivariate normal pair and we have found all parameters. © 


A quadratic form (that is, a polynomial with quadratic terms only) whose coefficients sat- 
isfy these conditions is called positive definite, because its values are then positive for any 
choice of x; and x2. (See any linear algebra text.) 
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We have another straightforward consequence of Definition 6.5.1: 


Theorem 6.5.4 (Linear Combinations of Bivariate Normals are Bivariate Nor- 
mal). /f X; and X2 form a bivariate normal pair and, for any constants C11, C12, C21; 
c22, d; and d2 satisfying Ci -f Cs #0, an + Ge 4 0 and c11C22 — €12C21 4 0, and 
Ty = ¢11X1 +¢12X2 + dy 
and 
To = c21X1 + €22X2 + dd, (6.166) 


then T, and T> are a bivariate normal pair, too. 


Proof. Substituting X; and X2 from Equations 6.133 into the definition of 7; and 
Ty, we get the latter as linear functions of Z; and Z2, which shows that they are a 
bivariate normal pair, too. a 


Corollary 6.5.1 (Existence of Independent Linear Combinations). /f X; and X2 
form a bivariate normal pair, then there exist constants C11, C12, C21, €22, Satisfying 
Gi + Ga #0, Gi + oe # 0 and c41€22 — €12C€21 4 0, such that 

T) = ¢11X1 +12X2 
and 

Ty = €21X1 +.€22X2 (6.167) 


are independent normal random variables. 


Proof. Suppose X, and X2 are given in terms of their parameters 0; > 0,02 > 0, 
P, U1, 42. If o = 0, then X; and X> are themselves independent, and so assume that 
p # 0. Clearly, 


Cov(T|, T2) = e112) Var(X1) + (c11€22 + c12€21)Cov(X1, X2) + c12¢22 Var(X2) 
= cc2107 + (c1iex2 + e12€21) 0102p + c12¢2035. (6.168) 


If we set, for instance, ca} = 0, ci, = C22 = 1 and cjy2 = —poj/o2, then 
Cov(7|, Tz) = O and the inequalities required of the c;;’s are also satisfied, and 
so T; and T> are independent. There exist infinitely many other solutions as well. 
One, a rotation, given in Exercise 6.5.5 is especially interesting. im 


Definition 6.5.1 can easily be generalized to more than two variables: 


Definition 6.5.2 (Multivariate Normal Random Variables). For any integers m, 
n > 2, let Z1, Z2,... , Z, be independent standard normal random variables and 
let aj; and b;, for alli = 1,2,...,m, j =1,2,...,n, be any constants satisfying 
i ap. # 0 for all i. Then the random variables 


Xj = 0 ajjZj + bj fori =1,2,...,m (6.169) 


are said to form a multivariate normal m-tuple. 
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Note that the joint distribution of the X; may be less than n-dimensional. (It is 
n-dimensional if and only if the matrix (q;;) has rank n.) 

Next, we state some theorems about multivariate normal random variables with- 
out proof. 


Theorem 6.5.5 (Two Linearly Independent Linear Combinations of Indepen- 
dent Normals Are Bivariate Normal). Any two of the Xj; defined above, say Xj; 
and Xx, form a bivariate normal pair, provided that neither X; — bj nor X_~ — bx is 
a scalar multiple of the other. 


Theorem 6.5.6 (For Multivariate Normal Random Variables Zero Covariances 
Imply Independence). /f X,, X2,... , Xm form a multivariate normal m-tuple and 
Cov(X;, Xx) = 0 for alli, k, then X1, X2,... , Xm are totally independent. 


Theorem 6.5.7 (Density Function of Multivariate Normal Random Variables). 
X1, X2,..., Xm form a multivariate normal m-tuple if and only if their joint density 
is of the form 


m m 


1 
Sf (%1, 2... »%m) = C exp -5 Yo > cic — i) Re - W| » (6.170) 


i=1 k=1 


where the 4; and [Lz are any constants and the cjz are such that the quadratic form 
baer yi Cik (Xi — Li) (Xk — LK) is positive semidefinite!! and C is a normalizing 
constant. 


The last theorem could be sharpened by giving an explicit formula for C and 
relating the c;, to the covariances (the jz; are the expected values), but even to state 
these relations would require concepts from linear algebra and the proof would re- 
quire multivariable calculus. We leave such matters to more advanced books. 


Exercises 


Exercise 6.5.1. Show that Y; and Y2 defined by Equations 6.138 are standard normal 
and Cov(Y1, Yo) = 0. 


Exercise 6.5.2. Let (X,, X2) be a bivariate normal pair with parameters 4; = 2, 
2 = —1,0, = 3,02 = 2 and p = 0.8. Find 


1. f (x1, %2), 

2. E(X2|X, = x1) and E(X,|X2 = x2), 

3. Var(X2|X1 = x;) and Var(X1|X2 = x2), 
4. fxo|x, (%2|x1) and fx, |x, (%1|x2), 

Ds Fx, 41) and Ix, (x2). 


a quadratic form is positive semidefinite if its value is > O for all arguments. For instance, 
in two dimensions (x + y)? is positive semidefinite. 
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Exercise 6.5.3. Let X; and X2 have a joint density of the form 


= 
f (1, x2) = Aexp a (x? + x1x0 + 2x2 — 2x14 + 6x2) : (6.171) 


where A is an appropriate constant. Show that (X 1, X2) is a bivariate normal pair 
and find its parameters and A. 


Exercise 6.5.4. Show that if (X, Y) has the joint density 
1 2 2 2 2 2 
FO) = = [ (w2e* /2_ @-t ) er (v2e¥"? ag ) ent | , (6.172) 
a 
which is not bivariate normal, then X and Y have standard normal marginal densities 
and their covariance is zero, but they are not independent. 


Exercise 6.5.5. Let (X,, X2) be a bivariate normal pair with Cov(X;, X2) 4 0. 
Show that the rotation 

T; = X;cos@ — X2 sind 

To = X; sind + X2 cos 6 (6.173) 


by the angle @ results in independent normal T; and 7> if and only if 


Var(X7) — Var(X 
ane eva (6.174) 
2Cov(X1, X2) 


Exercise 6.5.6. What is the probability that the average score of a randomly selected 
student in the two exams of Example 6.5.1 will be over 80? 


Exercise 6.5.7. What is the 90th percentile score in the second exam of Example 
6.5.1 for those students who scored 80 on the first exam? 


Exercise 6.5.8. The heights and weights of a large number of men were found to 
have a bivariate normal distribution with » = 0.7. If a randomly selected man’s 
height from this population is at the third quartile, then what is the percentile rank of 
the expected value of his weight under this condition? 


Exercise 6.5.9. Prove that a pair (X, X2) of random variables is bivariate normal if 
and only if Y = aX, +X? is normal for every choice of constants a and b, not both 
zero. 


Exercise 6.5.10. What is the probability of the wife being taller than her husband for 
arandomly selected couple from the population described in Example 6.5.2 (without 
any restriction on the husband’s height)? 


Exercise 6.5.11. Let (X,, X2) be a bivariate normal pair with parameters w; = 2, 
fz = —1,0, = 3,02 = 2 and p = 0.8. Find the parameters and the joint density of 
U; = X, + 2X2 and Un = X, — 2X2 +1. 
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The Elements of Mathematical Statistics 


7.1 Estimation 


In studying probability theory, we always assumed that we knew some probabilities 
and we computed other probabilities or related quantities from those. On the other 
hand, in mathematical statistics, we use observed data to compute probabilities or 
related quantities, or to make decisions or predictions. 

The problems of mathematical statistics are classified as parametric or nonpara- 
metric, depending on how much we know or assume about the distribution of the 
data. In parametric problems, we assume that the distribution belongs to a given 
family; for instance, that the data are observations of values of a normal random 
variable, and we want to determine a parameter or parameters, such as wz or o. In 
nonparametric problems we make no assumption about the distribution and want to 
determine either single quantities like F(X) or the whole distribution, i.e., F(x) or 
J (x), or use the data for decisions or predictions. 

We begin with some essential terminology. First, we restate and somewhat ex- 
pand Definition 6.2.3: 


Definition 7.1.1 (Random Sample, Statistic and Sample Mean). 1 independent 
and identically distributed (abbreviated: i.i.d.) random variables X1,... , Xp are said 
to form a random sample of size n from their common distribution. Any function 
g(X\,..., Xn) of the sample variables is called a statistic and the particular statis- 
tic X, = (1/n) >~ X; is called the sample mean. The probability distribution of a 
statistic is sometimes called a sampling distribution. 


Suppose that the common pf. or p.d-f. of the X; is f(x) or, if we want to in- 
dicate the dependence on a parameter, f(x; 0). We shall denote the joint p-f. or 
p.d.f. of (X1,..., Xn) by fn(x) or f(x; 6), where we use the vector abbreviations 
X = (X1,...,Xn) for the possible values of X = (X1,..., X,). We shall use the 
general notation @ for vector-valued parameters too, for example, for (i, 0) in the 
case of normal X;. 
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Definition 7.1.2 (Estimator and Estimate). Given a random sample X whose dis- 
tribution depends on an unknown parameter 0, a statistic g(X) is called an estimator 
of 6 if, for any observed value x of X, g(x) is considered to be an estimate of 0. 
The estimator g(X) is a random variable and, to emphasize its connection to 6, we 
sometimes denote it by ©. The observed value g(x) is a number (or a vector), which 
we also denote by 0. 


The most commonly used method for obtaining estimators and estimates is the 
following. 


Definition 7.1.3 (Method of Maximum Likelihood). Consider a random sample X 
whose distribution depends on an unknown parameter @. For any fixed x, the function 
n(x; 8) regarded as a function L(@) of 6, is called the likelihood function of 6. A 
value 6 of 6 that maximizes L(@) is called a maximum likelihood estimate of 0. In 
many important applications, 6 exists, is unique and is a function of x. For = g(x), 
we call the random variable © = g(X) the maximum likelihood estimator of 6. We 
abbreviate both the maximum likelihood estimate and maximum likelihood estimator 
as MLE. 


The reasoning behind this method is that among the various possible values of 6 
the most likely value should be one that makes the probability (or probability den- 
sity) of the observed x as high as possible. For example, consider a sample of just 
one observation x from a normal distribution with unknown mean yp (the general 
parameter 6 is now j) and known o. If we observe x = 2, then among the p.df. 
curves shown in Figure 7.1, the right-most one, with 44 = 2, is the most likely to 
have generated this x, and so we choose jz = 2, because that choice gives the highest 
probability to X being near the observed x = 2. 

Next, we present several examples using the method. 


Example 7.1.1 (Estimating the Probability of an Event). Consider any event A in 
any probability space and let p denote its unknown probability. Let X be a Bernoulli 
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r.v. with parameter 6 = p so that X = | if A occurs and 0 otherwise. (Such an 
X is called the indicator function of A.) To estimate p, we perform the underlying 
experiment n times and observe the corresponding i.i.d. Bernoulli random variables 
X1,...,Xn. The p.f. of X; can be written as 


fais p) =p" py for. xj = 0, 1. (7.1) 
Hence the likelihood function of p is 
n 
L(p) = fn; P) = | [e"a a ae — pil _ pyr us, (7.2) 
i=l 


To find the maximum of L(p), we may differentiate In(Z(p)): 


In(L(p)) = oxi np + (n= Px) nc = p), (7.3) 


and 
1 


d 1 
ap BE@) = Ls = (n = >) = (7.4) 


Setting this expression equal to 0, dividing by n and writing x, = (1/n) >> x;, we 
get 
=, fl _ 1 
Xp (1 — Xn) =0. (7.5) 
P l—p 


This equation gives the critical value p = xX,. The second derivative shows that L 
has a maximum there, as required. Thus, our maximum likelihood estimate of p is 
P = Xp and the corresponding maximum likelihood estimator is P= xy: 

This estimator has two very desirable properties: 


1. It is unbiased, that is, its expected value is the true (though unknown) value of 
the parameter: E(X;,) = p, by Equation 5.85. 

2. It is consistent, meaning that it converges to p in probability as n — oo, that is, 
limy—o0 P(|X_ — p| < €) = 1, by the law of large numbers, Theorem 5.2.7. 


Example 7.1.2 (Estimating the Mean of a Normal Distribution with Known Vari- 
ance). For a random sample from an N (wu, o”) distribution with known o and un- 
known 1, the likelihood function of ju is 


n 


Li) = fxs =] ] 


i=1 


n 
eo iH)? /20? = ( 1 ) e7 LGB) 20? 
V 200 


200 
(7.6) 


Clearly, this function takes on its maximum when 


a. ae 
BH) = ao3 DU LL) (7.7) 
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is minimum. Differentiating and setting g’(j1) to zero, we get 


1 n 
8 (W) =—-Z5 >> 2G: — u) =0, (7.8) 
i=l 
from which 
n 
Yoxj — np =0. (7.9) 


i=1 


Hence w = (1/n) ¥> x; = Xp is the critical value of w. Since g’(u) = n/o* > 0, 
the function g has a minimum and the function L, a maximum at f@ = Xp. 

Thus, again, the maximum likelihood estimate is x, and the maximum likelihood 
estimator is M = X,, with the same two properties that were mentioned at the end 
of the preceding example. 


Example 7.1.3 (Estimating the Mean and Variance of a Normal Distribution). For a 
random sample from an N (1, 7) distribution with unknown yj and o,, the likelihood 
function is a function of two variables, or, in other words, the parameter 0 may be 
regarded as the two-dimensional vector (1, o”) or as (4, 0). Thus 


- 1 "Se 2/292 
Liu,0) = fu®i no) =] | ) een, 
i=] 


=(xj-M)? 79 2 ( 
e 267° = 
_| V210 / 


210 
(7.10) 


Now, we need to set the two partial derivatives of L equal to zero, and solve 
the resulting two equations simultaneously. The solution of 0L/du = O turns out 
to be independent of o and exactly the same as in the previous example. So, we get 
jl = Xp again. 

To solve OL /d0 = 0, we use logarithmic differentiation: 


1 n 
InL(u, 0) = —nInV27x — nino Y & pL) (7.11) 
20? 
and 
niga eee (7.12) 
—— nol ¢ = , — 5 . 
00 pe oO o? = a 
which yields 


12 1/2 
_{[- bea 2: 
c= E di LL) . (7.13) 


Using the second derivative test for functions of two variables, we could show that L 
has a maximum at these values of w ando. 
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Hence the MLE of the standard deviation is 


a 1/2 
eee (aes 2 ce eee 
G= (: NG: Xn) (7.14) 


We leave it as an exercise to show that the MLE o? of the variance equals G?. 
Next, we are going to show that the corresponding estimator 
Ps 1 n er 
52 = - » (Xi; — Xn) (7.15) 


i=1 


of the variance is biased. (=? is called the sample variance and ¥ the sample stan- 
dard deviation.) 


Let us first reformulate the sum in the above expression: 


n n n 
¥° (Xi — Xn) = 92 X? — 2%, D°X; +X; (7.16) 
i=l i=l 


i=l 


Substituting )~"_, Xj = nX~y in the middle term on the right-hand side, we get 


>> (Xi — Xn) = > xX} - 2X; (717) 


Hence 


E (; : (Xx; -%,)}) = - YE (x?) _E (x,)- (7.18) 


II 

| 
Q 

NR 

+ 
= 
Lo? 
= 

| 


a 
2" "6? (7.19) 
n 


As the above formula shows, we can define an unbiased estimator of the variance 
by 


“a noa 1 7 = \2 
Ve a= X;—Xn), 7.20 
aq 4 > (X; — Xn) (7.20) 
and this is the estimator used by most statisticians, together with the corresponding 
estimate 0 instead of G?. In fact, many books call this V the sample variance, and 


most statistical calculators have keys for both 0 and G°. For large n, the two estimates 
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differ by very little, and the choice is really arbitrary anyway. In principle, however, 


@* seems more natural and, although V is an unbiased estimator of the variance, V V 


is not an unbiased estimator of the standard deviation.! 


Example 7.14 (Estimating the Upper Bound of a Uniform Distribution). Let X be 
uniform on the interval [0, 0], with the value of the parameter 6 unknown. Then the 
p.f. of X is given by 


£08) 1/6 ifO0<x<0@ (7.21) 
x30) = : 
0 otherwise, 
and so the likelihood function is given by 
1/6" ifO <x; < Ofori =1,..., 
igs Meret _ (7.22) 
0 otherwise. 


Since 1/0” is a decreasing function of 0, its maximum occurs at the smallest value 
of @ that the inequalities x; < 0 allow. (Recall from calculus that the maximum of 
a continuous function on a closed interval may occur at an endpoint of the inter- 
val, rather than at a critical point.) Thus the MLE estimate of 6 must be the largest 


observed value x;, that is, 
6 = max{x},... , Xp}. (7.23) 


Note, however, the curious fact that if X were defined to be uniform on the open 
interval (0, @) rather than on the closed interval, then the MLE would not exist, be- 
cause then we would have to maximize L(@) subject to the conditions 0 < xj < 0 
and so max{x1,... ,X,} would not be a possible value for 6. 4 


For all its many successes and popularity, the method of maximum likelihood 
does not always work. In some cases the maximum does not exist or is not unique. 

Often another method, the method of moments is used to find estimators. This 
method consists of expressing a parameter as a function of the moments of the r.v. 
and using the same function of the sample moments as an estimator of the parameter. 


Example 7.1.5 (Estimating the Parameter of an Exponential Distribution). Consider 
an exponential r.v. X with parameter 4. Then, by Example 5.1.4, = 1/E(X). 
Hence, according to the method of moments, we estimate A by A = 1/X,. On the 
other hand, by Equation 5.77, 4 = 1/SD(X) as well, and so we could estimate 1 by 
1/% also. 


Example 7.1.6 (Estimating the Parameter of a Poisson Distribution). Consider a 
Poisson r.v. X with parameter 4. Then, by Theorem 6.1.2 4 = E(X) = Var(X). 
Thus, the method of moments suggests the estimator A = X,, or A = =?. 4 


! The requirement that an estimator be unbiased can lead to absurd results. See M. Hardy, 
An illuminating counterexample, Am. Math. Monthly 110 (2003) 234-238. 
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Another popular method for obtaining estimators is based on Bayes’ theorem, 
but we shall not discuss it here. 

On the other hand, even the best estimator can be off the mark. For instance, if 
we toss a fair coin, say, ten times, we may easily get six heads, and so by Exam- 
ple 7.1.1 we would estimate p as 0.6. Consequently, we want to know how much 
confidence we can have in such an estimate. This question is usually answered by 
constructing intervals around the estimate so that these intervals cover the true value 
of the parameter with a given high probability. In other words, we construct interval 
estimates instead of point estimates. 


Example 7.1.7 (Interval Estimates of the Mean of a Normal Distribution with Known 
Variance). Let X; be i.i.d. N(u, 0”) random variables fori = 1,...,n. Then, by 
Corollary 6.2.3, the sample mean X,, is N(w, o7/n), and so 


( 


for any c > 0, or, equivalently, 


Xn — Ub 


o/Jn 


< :) =20(c)-1 (7.24) 


P(X, e <p <¥n te 
If we assume that o is known and yz is unknown, then Equation 7.25 can be inter- 
preted as saying that the random interval (X, — clo //Nn), Xnt c(a/./n)) contains 
the unknown, but fixed, parameter jz with probability 2®(c) — 1. We must emphasize 
that this statement is different from our usual probability statements in which we are 
concerned with a random variable falling in a fixed interval. Here the jy is fixed and 
the endpoints of the interval are random variables, because X,, is a Statistic computed 
from a random sample. 
Now, if we observe a value X, of X,, then the fixed, and no longer random, 
interval 


) =2(c) -1. (7.25) 


i =¢- =, 3, + 6— (7.26) 
(5 - em <<) : 


is called a confidence interval for js with confidence coefficient or level y = 
2@(c) — 1, or a 100y percent confidence interval. We cannot say that y falls in 
this interval with probability y , because neither jz nor the interval is random; this 
is why we use the word “confidence” rather than “probability.” The corresponding 
probability statement, Equation 7.25 implies that if we observe many such confi- 
dence intervals from different samples, that is, with different observed values for X,,, 
then approximately 100y percent of them will contain 44. Whether a single such in- 
terval will contain jz or not, we usually cannot say. What we can always say is that, 
by its definition, our interval is a member of a large set of similar potential intervals, 
100y percent of which do contain j. 

It was natural for us to start our discussion of confidence intervals with an arbi- 
trary value for c, but, in applications, it is more common to start with given confi- 
dence coefficients y. Then c can be computed as 
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saan! (=) (7.27) 


Thus, for instance, if we want a 95% confidence interval for jw, then y = 0.95 
yields c = ©~!(1.95/2) = ®~!'(0.975) ~ 1.96, that is, c will be the 97.5th per- 
centile of the standard normal distribution, which is approximately 1.96. Hence, for 
an observed sample mean X,,, the interval (x, — 1.96(0/./n), Xn + 1.96(c./n)) isa 
95% confidence interval for ju. ¢ 


We generalize the concepts introduced in the above example as follows: 


Definition 7.1.4 (Confidence Intervals). Consider a random sample X whose dis- 
tribution depends on an unknown parameter 6 and two statistics A = gi(X) and 
B = go(X) with A < B. If a and b are any observed values of A and B and 
P(A < 0 < B) = y, then (a, b) is called a 100y percent confidence interval for 
y and y the confidence coefficient or level of the interval (a, b) 2 If A = —oo or 
B = ov, then (—on, b) and (a, oo) are called one-sided confidence intervals. 


The construction in Example 7.1.7 of confidence intervals for the mean of a nor- 
mal distribution can be used for the mean of other distributions, or with unknown o, 
in the case of large samples, when the CLT is applicable. In the case o is unknown, 
we just use G from Equation 7.14 in Equation 7.25. 


Example 7.1.8 (Confidence Intervals for the Probability of an Event). As in Example 
7.1.1, consider any event A in any probability space and let p denote its unknown 
probability. Let X be a Bernoulli rv. with parameter p so that X = 1 if A occurs 
and 0 otherwise. To estimate p, we perform the underlying experiment n times and 
observe the corresponding i..d. Bernoulli random variables X1,... , X,. As in Ex- 
ample 7.1.1, let P = X,, and P = Xn. We use the sample variance (Equation 7.15) as 
the estimator of the variance of X. In the present case, )~""_, X = )-j-1 Xi because 
each X; is 0 or 1, and so 


=-)°x,;-X,=¥X,-X,=P(-P). (7.28) 
WTA 
Having observed the values x;,... , X,, We use the corresponding estimate 
Go? = p(l—-Pp) (7.29) 


of «7. Notice that this estimate is the same as that which we would get by replacing 
p in Equation 5.86 by p. 


2 Some people use a slightly different terminology. They call the random interval (A, B) a 
confidence interval and (a, b) its observed value. 
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Now, if 1 is large, then, by the deMoivre—Laplace theorem, the distribution of 


P = X,, is approximately normal, and so, for any c > 0, the interval 


(p-</7=P ps 0/222) (7.30) 


from Example 7.1.7 is an approximate confidence interval for p with confidence 
level y = 2®@(c) — 1, provided both endpoints lie between 0 and 1. 

If one of the endpoints lies outside [0, 1], then, since p is a probability, we use 
a one-sided confidence interval. For instance, if p + c./[p( — p)/n] > 1, then the 


interval 
a pd- 
(p-</2=? P) ) (7.31) 
n 


is an approximate confidence interval for p with confidence level y = 1— ® (—c) = 
®D(c), because 


As can be seen from the general definition, a confidence interval does not have 
to be symmetric about the estimate. In the examples above, however, the symmetric 
confidence interval was the shortest one. On the other hand, in some applications, we 
are interested in one-sided confidence intervals, as in the example below. 


Example 7.1.9 (Voter Poll). Suppose a politician obtains a poll that shows that 52% 
of 400 likely voters, randomly selected from a much larger population, would vote 
for him. What confidence can he have that he would win the election, assuming that 
there are no changes in voter sentiment until election day? 

We need the same setup as in Example 7.1.8. We know the sample size n and the 
proportion p of favorable voters in the sample and want to find the confidence level 
of the winning interval? (0.50, oo) for the proportion p of favorable voters in the 
voting population. 

The normal approximation gives 


fe 
P <c] = ®(c), (7.32) 


for any c > 0, or, equivalently, 


3 Of course, a probability cannot be greater than 1, and so the upper limit of the interval 
should be | rather than oo, but the normal approximation gives only a minuscule probability 
to the (1, oo) interval, and we may therefore ignore this issue. 
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P{ X, — — = O(c) (7.33) 
Cc < c). . 
. Jn s 


Thus, with p = X, and G? = p(1 — p), the interval (p — c./p(. — p)/n, oo) isa 
y = ®(c) level confidence interval for p. 
From the given data, p = 0.52 and n = 400 and so 


(>- pee, ~) = (0.52 —0.02498c, 00). (7.34) 
n 


The politician wants to know the confidence level of the (0.50, oo) interval. Thus, 
we need to solve 0.52 — 0.02498c = 0.50, which results inc = 0.80 and y = 
(0.80) = 0.788. Thus, by this poll, he can have approximately 78.8% confidence 
in winning the election. 


Exercises 


Exercise 7.1.1. In Equation 7.10 replace o* by v and differentiate with respect to v 
to show that the MLE o2 = 7 of the variance equals G°. 


Exercise 7.1.2. Find the MLE for the parameter A of an exponential r.v. 


Exercise 7.1.3. Show that G7, as given by Equation 7.14 for a normal r.v. X and n 
distinct values x1,... , X,, equals the variance of a discrete r.v. X* with n distinct, 
equally likely possible values x1,... , Xn. 


Exercise 7.1.4. Find the MLE 4 for the parameter A of a Poisson r.v. (Note that this 
MLE does not exist if all observed values equal 0.) 


Exercise 7.1.5. Let X be a continuous r.v. whose p.d f., for A > 0, is given by 


fe) a HOee <1 (7.35) 
xiA)= : 
0 otherwise. 


(a) Find the MLE for the parameter 1. 
(b) Find an estimator for 4 by the method of moments. (Hint: First compute E (X).) 


Exercise 7.1.6. Let X be uniform on the interval [6), 62]. Find the MLE’s of 6; and 
02. (Hint: The extrema occur at the endpoints of an interval.) 


Exercise 7.1.7. Let X be uniform on the interval (0, 6). Show that 
© = ((n+ 1)/n) max(X1,... , Xn) 


is an unbiased estimator of 0. 
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Exercise 7.1.8. A random sample of 50 cigarettes of a certain brand is tested for 
nicotine content. The measurements result in a sample mean jz = 20mg and sample 
SD @ = 4mg. Find 90, 95 and 99% confidence intervals for the unknown mean 
nicotine content jz of this brand, using the normal approximation. 


Exercise 7.1.9. A random sample of 500 likely voters in a city is polled and 285 are 
found to be Democrats. Find 90, 95 and 99% approximate confidence intervals for 
the percentage of Democrats in the city. 


Exercise 7.1.10. In a certain city, the mathematics SAT scores of a random sample 
of 100 students are found to have mean ji; = 520 in 2002, and of another random 
sample of 100 students, 772 = 533 in 2003, with the same SD o; = G2 = 60 in 
both years. The question is whether there is a real increase in the average score for 
the whole city, or is the increase due only to chance fluctuation in the samples. Find 
the confidence level of the one-sided confidence interval (0, oo) for the difference 
|. = [Lz — [41 Use the normal approximation and 6? = G7? + G3. What conclusions 
can you draw from the result? 


7.2 Testing Hypotheses 


In many applications, we do not need to estimate the value of a parameter 6, we just 
need to decide in which of two nonoverlapping sets, Qo or Q, it is likely to lie. The 
assumption that it falls in Qo is called the null hypothesis Ho and the assumption that 
it falls in Q 4, the alternative hypothesis Ha. 

Often, we want to test some treatment of the population under study, and then the 
null hypothesis corresponds to the assumption that the treatment has no effect on the 
value of the parameter, while the alternative hypothesis corresponds to the assump- 
tion that the treatment has an effect. In other cases, we may compare two groups and 
then the null hypothesis corresponds to the assumption that there is no difference 
between certain parameters for the two groups, while the alternative hypothesis cor- 
responds to the assumption that there is a difference. Based on a test statistic Y from 
sample data, we wish to accept one of these hypotheses for the population(s), and 
reject the other. In this section, we consider only one-point sets for Ho of the form 
Qo = {6}. Such a hypothesis is called simple. Any hypothesis that corresponds to 
more than one @ value is called composite. H, is mostly considered to be composite, 
with Q,4 of the form {6 | 0 < 4}, {9 | 8 > O} or {0 | 6 $ A}. The distribution of a 
test statistic Y under the assumption Hp is called its null-distribution. 


Example 7.2.1 (Cold Remedy). Suppose a drug company wants to test the effective- 
ness of a proposed new drug for reducing the duration of the common cold. The drug 
is given ton = 100 randomly selected patients at the onset of their symptoms. Sup- 
pose that the length of the illness in untreated patients has mean j19 = 7 days and SD 
o = 1.5 days. Let X denote the average length of the cold in a sample of 100 treated 
patients and, say, we observe X = 5.2 in the actual sample. This example fits in the 
general scheme by the identifications 9 = yz and Y = X. 
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The question is: Is this reduction just chance variation due to randomness in the 
sample, or is it real, that is, due to the drug? In other words, we want to decide 
whether the result X¥ = 5.2 is more likely to indicate that the sample comes from a 
population with mean jz = 7, or one with reduced mean fz < 7. (One may think of 
this population as the millions of possible users of this drug. Would they see a re- 
duced duration on average?) More precisely, assuming that X is normally distributed 
(by the CLT, even if the individual durations are not) with SD o/./n, but with an 
unknown mean jz, we want to decide, based on the observed value of the estimator 
X of yw, which of the hypotheses 4p = jo or  < jU9 to accept. The first of these 
conditions is Ho and the second one is Hg. (To be continued.) 


Example 7.2.2 (Weight Reduction). We want to test the effectiveness of a new drug 
for weight reduction and administer it, say, to a random sample of 36 adult women for 
a month. Let X denote the average weight loss (as a positive value) of these women 
from the beginning to the end of the month and E the sample SD of the weight 
losses. Suppose we observe X = 1.5 lbs. and = =A lbs. By the CLT we assume that 
X, the average weight loss of women in samples of size 36, is normally distributed 
and we estimate o by the observed value of ¢ = 4 lbs. The mean weight loss ju 
(of a hypothetical population, from which the sample is drawn) is unknown and we 
want to decide whether the observed X value supports the null hypothesis 1 = 0, 
that is, whether the observed average weight reduction is just chance variation due 
to randomness in the sample, or it supports the alternative hypothesis jz > 0, that the 
reduction is real, i.e., caused by the drug. (To be continued.) > 


Now, how do we decide which hypothesis to accept and which to reject? We use 
atest statistic Y, like X in the examples above, which is an estimator of the unknown 
parameter 6, and designate a set C such that we reject Hp and accept H4 when the 
observed value of Y falls in C, and accept Ho and reject H, otherwise. We allow no 
third choice. This procedure is called a (statistical) test and the set C is called the 
rejection region (we reject Ho) or the critical region of the test 

The set C is usually taken to be an interval of the type [c, 00) or (—00, c] or the 
union of two such intervals. Which of these types of sets is used as C, is determined 
by the alternative hypothesis. If H, is of the form yz < jo, as in Example 7.2.1, then 
Hg is supported by small values of X, and so we take C to be of the form (—oo, c]. 
On the other hand, as in Example 7.2.2, if H4 is of the form uw > jo, then Hy, is 
supported by large values of X, and so we take C to be of the form [c, 00). Finally, 
if Hy, is of the form yw ¥¢ fo, then we take C = [Wg + ¢, 00) U (—O0, Mo — c]. In 
this section we assume that Hp is of the form uu = U0.) 

To complete the description of a test, we still need to determine the value of the 
constant c in the definition of the critical region. We determine c from the probability 
of making the wrong decision. 

There are two types of wrong decisions that we can make: Rejecting Ho, when it 
is actually true, which is called an error of type 1, and accepting Ho when H, is true, 


4 In some books the rejection region is defined to be the set in the n-dimensional space of 
sample data that corresponds to C. 
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which is called an error of type 2. In the examples above, a type | error would mean 
that we accept an ineffective drug, while a type 2 error would mean that we reject 
a good drug. The usual procedure is to prescribe a small value a for the probability 
of an error of type 1 and devise a test, that is, a rejection region C, such that the 
probability of Y falling in C is a, if Ho is true. The probability a of a type | error is 
called the level of significance of our test. Thus 


a = P(X € C|Ao) = P(type | error). (7.36) 


a is traditionally set to be 5% or 1%, and then, knowing the distribution of xX 
when Hp is true, we use Equation 7.36 to determine the set C. 

For a = 5%, we call the observed value of Y, if it does fall in C, statistically 
significant, and if a is set at 1% and Y is observed to fall in C, then we call the result 
(that is, the observed value of Y, supporting H,) highly significant. 

In most cases, statistical tests are based on consideration of type | errors alone, 
as described above. The reason for this is, that we usually want to prove that a new 
procedure or drug is effective and we publish or use it only if the statistical test rejects 
HA. But in that case we can commit only a type | error, i.e., we reject Ho wrongly. In 
fact, most medical or psychological journals will accept only statistically significant 
results. 

Nevertheless in some situations we want or have to accept Ho and then type 2 
errors may arise. We shall discuss them in the next section. 

We are now ready to set up the tests for our earlier examples. 


Example 7.2.3 (Cold Remedy, Continued). Our test statistic is X, which we take to 
be a normal r.v., since n is sufficiently large for the CLT to apply. Because of this use 
of the CLT, a test of this kind is called a large-sample Z-test. 

We are interested in the probability of a type | error, that is, of wrongly rejecting 
Ho : u = 7, that the drug is worthless, when it actually is worthless. So, we assume 
that Hp is true and take the parameters of the distribution of X to be jzg and o/./n, 
which in this case are 7 and 1.5/./100 = 0.15, respectively. Since H, is of the form 
[L < [Uo, we take the rejection region to be of the form (—oo, c], that is, we reject Hp 
if X < c. We determine c from the requirement 


P(X <clHo) =a. (7.37) 


Setting a = 1%, we have 


_ Ge a) (<2) 
P(X <c)=P < ~~ @ = 01. (7.38) 


0.15 > 0.15 0.15 
Hence 
C~7 _ 6-101) ~ —2.33 (7.39) 
0.15 oo 


and 
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cX7—2.33-0.15 © 6.65. (7.40) 


Thus, the observed value X = 5.2 is < c, and this result is highly significant. In 
other words, the null hypothesis, that the result is due to chance, is rejected, and the 
drug is declared effective. (Whether the reduction of the mean length of the illness 
from 7 to 5.2 days is important or not, is a different matter, for which statistical theory 
has nothing to say. We must not mistake statistical significance for importance. The 
terminology is misleading: A highly significant result may be quite unimportant; its 
statistical significance just means that the effect is very likely real and not just due to 
chance.) 


Example 7.2.4 (Weight Reduction, Continued). Our test statistic is again X, which 
we assume to be normal with mean zg = 0 and SD G/./n = 4/6. Since Hy is of 
the form 44 > [o9, we take the rejection region to be of the form [c, oo), that is, we 
reject Ho if X > c. We determine c from the requirement 


P(X >c) =a. (741) 
Setting a = 1%, we have 
= xX —0 c—0 é 
P(X >c)®P > ~1- o/( ) =001, (7.42) 
0.667 ~ 0.667 0.667 
Hence 
c -1 
—— = © !(0.99) © 2.33 743 
0.667 ee oe 
and 
c & 2.33 - 0.667 © 1.55. (7.44) 


Thus, the observed value X = 1.5 is < c, and this result is not highly significant. 
At this 1% level, we accept Ho. 
On the other hand, setting a = 5%, we determine c from 


P(X >c) xP sls ae) ee ( ) = 0.05 (7.45) 
=e \ Ober — ORer) 0.667/ °° 
and we get 

ia -1 

—“_ _ @-!( 95) ~ 1.64 74 

0.667 ee Os? oa, 
and 

c © 1.645 - 0.667 © 1.10. (7.47) 


Thus, the observed value X = 1.5 is > c, and so this result is significant, though, 
as we have seen above, not highly significant. 

In other words, the null hypothesis, that the result is due to chance, is rejected 
at the a = 5% level but accepted at the aw = 1% level. The drug may be declared 
probably effective, but perhaps more testing, i.e., a larger sample is required. 4 
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Presenting the result of a test only as the rejection or acceptance of the null 
hypothesis at a certain level of significance, does not make full use of the information 
available from the observed value of the test statistic. For instance, in this example, 
the observed value X = 1.5 was very close to the c = 1.55 value required for a highly 
significant result, but this information is lost if we merely report what happens at the 
1% and 5% levels. In order to convey the maximum amount of information available 
from an observation, we usually report the lowest significance level at which the 
observation would lead to a rejection of Ho. Thus, we report P(X > c|Ho) for the 
observed value c of X. This probability is called the observed significance level or 
P-value of the result. In this case, it is 


_ x—=0. 155=0 1.55 
P(X > 1.55) +P S ~1—o ~ 0.0102. (7.48) 
0.667 ~ 0.667 0.667 


In general, we make the following definition: 


Definition 7.2.1 (P-Value). The observed significance level or P-value of a result 
involving a test statistic Y is defined as P(Y € C|Ho) with the critical region C being 
determined by the observed value c of Y. 


In Example 7.2.1, for instance, with c = 5.2, the P-value is 


X-7 2 5.2—7 
0.15 ~ 0.15 


pF =saim) =P )=oci=o (7.49) 


which calls for the rejection of Ho with virtual certainty, in contrast to the relatively 
anemic 1% obtained above in the weight reduction example. 


We summarize the Z-test in the following definition: 


Definition 7.2.2 (Z-Test). We use this test for the unknown mean jz of a population 
if we have (a) a random sample of any size from a normal distribution with known o 
or (b) a large random sample from any distribution so that X is nearly normal by the 
CLT. The null hypothesis is Ho: “4 = 40, where juo is the z-value we want to test 
against one of the alternative hypotheses H4: u > lo, Ll < Lo, or UW ~ Lo. The test 
statistic is 
X — Oo 
LZ = aa (7.50) 


in case (a), and 
_ X= po 
o/J/n 
in case (b), where @ is given by Equation 7.14. Let z denote the observed value 


of Z, i.e., the value computed from the actual sample as z = (X — plo) /(a/ Jn) or 
z= (¥ — uo)/(//n), where X is the observed value of the random variable X. 


(7.51) 
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Then, for Ha: “ < [lo the P-value is ®(z), for Ha: & > po the P-value is 
1 — ®(z), and for H4: « ¢ lo the P-value is 2(1 — ®(|z|)). We reject Ho if the 
P-value is small and accept it otherwise. 


Example 7.2.5 (Testing Fairness of a Coin; a Two-Tailed Test). Suppose we want to 
test whether a certain coin is fair or not. We toss itn = 100 times and use the relative 
frequency X of heads obtained as our test statistic. We want to find the rejection 
region that results in a level of significance a = 0.05. 

X is binomial, but by the CLT we can approximate its distribution with a nor- 
mal distribution having parameters 4 = p = P(A) and o = 4/pq/100. The 
hypotheses we want to test are Hj) : p = 0.5 and Ha: p ¥ 0.5. Thus, for Ho, 
o = J/(1/2)- (1/2) - (1/100) = 0.05. The rejection region should be of the form 
C = (—00, 0.5 —c] U[0.5 +c, 00) = (0.5 — c, 0.5 +c). The requirement a = 0.05 
translates into finding c such that 


(Hoe c ) 
P(C|Ho) = P(\X —0.5| > c) =P 


0.05 ~ 0.05 
~ P (|Z| a —-) =2(1- (—.)) = 0.05 (7.52) 
0.05 0.05 , 
or 
o (——) = 0.975 753 
(<a) = 0.975. (7.53) 
Hence c/0.05 © 1.96 and c © 0.098. So, we accept Ho, that is, declare the coin fair, 
if X falls in the interval (0.402, 0.598), and reject Ho otherwise. > 


In many statistical tests, we have to use distributions other than the normal, as in 
the next example. 


Example 7.2.6 (Sex Bias in a Jury). Suppose the 12 members of a jury are randomly 
selected from a large pool of potential jurors consisting of an equal number of men 
and women, and the jury ends up with 3 women and 9 men. We wish to test the 
hypothesis Ho that the probability p of selecting a woman is po = 1/2, versus 
the alternative H, that p < 1/2. Note that Ho means that the jury is randomly 
selected from the general population, about half of which consists of women, and 
H, means that the selection is done from a subpopulation from which some women 
are excluded. 

The test statistic we use is the number X of women in the jury. This X is binomial 
and, under the assumption Ho, it has parameters n = 12 and po = 1/2. The rejection 
region is of the form {x < c} and, to obtain the P-value for the actual jury, we must 
use c = 3. Thus, the P-value is 


3 12 1 12 
P(X < 3|Ho) = 2 ( ) (;) = 0.073. (7.54) 


k=0 


So, although the probability is low, it is possible that there was no sex bias in this 
jury selection and we accept the null hypothesis. To be more certain, one way or the 
other, we would have to examine more juries selected by the same process. 4 
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In cases where the sample is small, the distribution is unknown and the evidence 
seems to point very strongly against the null hypothesis, we may use Chebyshev’s 
inequality to estimate the P-value, as in the next example. 


Example 7.2.7 (Age of First Marriage in Ancient Rome). Lelis, Percy and Verstraete* 
studied the ages of Roman historical figures at the time of their first marriage. They 
did this to refute earlier improbably high age estimates that were based on funerary 
inscriptions. Others had found that for women, the epitaphs were written by their 
fathers up to an average age of 19 and after that by their husbands, and jumped to the 
conclusion that women first married at an average age of 19. (A similar estimate of 
26 was obtained for men.) 

From the historical record, the ages at first marriage of 26 women were 11, 12, 
12, 12, 12, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 16, 16, 
17,17. 

The mean of these numbers is 14.0 and the standard deviation is 1.57. 

A random sample of size 26, is just barely large enough to assume that the aver- 
age is normally distributed with standard deviation 1.57/26 ~ 0.31, nevertheless, 
we first assume this, but then obtain another estimate without this assumption as 
well. 

This sample, however, is a sample of convenience. We may assume though that 
it is close to a random sample, at least from the population of upper class women. 
We also assume that marriage customs remained steady during the centuries covered. 
(For this reason, we omitted three women for whom records were available from the 
Christian era.) 

We take the null hypothesis to be that the average is 19, and the alternative hy- 
pothesis to be that it is less. With the above assumptions, we can compute the P-value, 
that is, the probability that the mean in the sample turns out to be 14 or less if the 
population mean is 19, as 


) =~ @&(—16) © 0. 


_ X-19 14-19 14-19 
P(X < 14)=P < nw 
0.31 0.31 0.31 


Thus, the null hypothesis must be rejected with practical certainty, unless the 
assumptions can be shown to be invalid. 

The ridiculously low number we obtained, depends heavily on the validity of the 
normal approximation, which is questionable. We can avoid it and compute an esti- 
mate for the P-value by using Chebyshev’s inequality (see Theorem 5.2.6) instead, 
which is valid for any distribution. Using the latter, we have P(|Xn —pl| > 6) = 
P(\X, — 19] > 5) < o7/ne? © 1.577 /26- 57 © 3.8 x 1073. This estimate, though 
very crude (in the sense that the true P-value is probably much lower), is much more 
reliable than the one above, and it is still sufficiently small to enable us to conclude 
that the null hypothesis, of an average age 19 at first marriage, is untenable. 


5 A.A. Lelis, W. A. Percy and B. C. Verstraete, The Age of Marriage in Ancient Rome (The 
Edwin Mellen Press, 2003) 
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So, how can one explain the evidence of the tombstones? Apparently, people 
were commemorated by their fathers if possible, whether they were married or not at 
the time of their deaths, and only after the death of the father (who often died fairly 
young) did this duty fall to the spouse. 4 


In many applications we analyze the difference of paired observations. For in- 
stance, the difference in the blood pressure of people before and after administering 
a drug can be used to test the effectiveness of the drug. Similarly, differences in twins 
(both people and animals) are often used to investigate effects of drugs, when one 
twin is treated and the other is not. The genetic similarity of the twins ensures that 
the observed effect is primarily due to the drug and not to other factors. 


Example 7.2.8 (Smoking and Bone Density). The effect of smoking on bone density 
was investigated by studying pairs of twin women.° A reduction in bone density 
is an indicator of osteoporosis, a serious disease mainly of elderly women, which 
frequently results in bone fractures. Among other results, the bone density of the 
lumbar spine of 41 twin pairs was measured, the twins of each pair differing by 5 
or more pack-years of smoking. (Pack-years of smoking was defined as the lifetime 
tobacco use, calculated by the number of years smoked times the average number of 
cigarettes smoked per day, divided by 20.) The following mean bone densities were 
obtained (SE means the SD of the mean): 


Lighter smoker | Heavier smoker 
(g/cm) (g/cm) Difference 


Mean + SE | 0.795 + 0.020 | 0.759+0.021 | 0.036 +0.014 


The null hypothesis was that the mean bone densities jz2 and jz; of the two popu- 
lations, the heavier and the lighter smokers, are equal, that is, that uw = 1 — U2 = 0, 
and the alternative that jz > 0. The test statistic is the mean difference in the sample, 
which is large enough for the normal approximation to apply. Thus, the observed 
z-value is z = 0.036/0.014 ~ 2.57, and so the P-value is P(X. — X, > 2.57) © 
1 — @(2.57) © 0.005, a highly significant result. Apparently, smoking does cause 
osteoporosis. 4 


We shall return to hypothesis testing with different parameters and distributions 
in later sections. 


Exercises 


In all questions below, formulate a null and an alternative hypothesis for a population 
parameter, set up a test statistic and a rejection region, compute the P-value, and 
draw a conclusion whether to accept or reject the null hypothesis. Use the normal 
approximation in each case. 


67, 1, Hopper and E. Seeman, The Bone Density of Female Twins Discordant for Tobacco 
Use. NEJM, Feb. 14, 1994. 


7.3 The Power Function of a Test 239 


Exercise 7.2.1. At a certain school there are many sections of calculus classes. On 
the common final exam, the average grade is 66 and the SD is 24. In a section of 32 
students (who were randomly assigned to this section), the average turns out to be 
only 53. Is this explainable by chance or does this class likely come from a population 
with a lower mean, due to some real effect, like bad teaching, illness or drug use? 


Exercise 7.2.2. On a large farm the cows weigh on the average 520 kilograms. A 
special diet is tried for 50 randomly selected cows and their weight is observed to 
have an average of 528 kilograms and an SD of 25 kilograms. Is the diet effective? 


Exercise 7.2.3. Assume that a special diet is tried for 50 randomly selected cows 
and their weight is observed to increase an average of 10 kilograms with an SD of 
20 kilograms. Is the diet effective? 


Exercise 7.2.4. In a certain large town 10% of the population is black and 90% is 
white. A jury pool of 50, supposedly randomly selected people, turns out to be all 
white. Is there evidence of racial discrimination here? 


7.3 The Power Function of a Test 


In the preceding section we discussed type | errors, that is, errors committed when 
Hp is true but is erroneously rejected. Here we are going to consider errors of type 
2, that is, errors committed when Hp is erroneously accepted although Hy is true. 
Whenever we accept Ho, we should consider the possibility of a type 2 error. 

Since H, is usually composite, that is, it corresponds to more than just a single 
value of the parameter, we cannot compute the probability of a type 2 error without 
specifying for which value of @ in Q, this probability 6(@) is computed. Thus, with 
Y denoting the test statistic and C its critical region (where we reject Ho), 


B(@) = P(type 2 error | 9 € Q4) =P(Y €C | 6 € Qa). (7.55) 
B(@) is sometimes called the size of a type 2 error for the given value 0. 


Definition 7.3.1 (Power Function). The power function of a test is the function 
given by 


m(0) =P(Y €C | @) for é@ € Q9UQ, (7.56) 
and the function given by 1 — 2(@), the operating characteristic function of the test. 


The reason for the name “power function” is that for 9 € ©, the value of the 
function measures how likely it is that we reject Ho when it should indeed be rejected, 
i.e., how powerful the test is for such 0. 

The name “operating characteristic function” comes from applying such tests to 
acceptance sampling, that is, to deciding whether to accept a lot of certain manufac- 
tured items, by counting the number x of nondefectives in the sample (see Examples 
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5.1.14 and 5.1.13), and accepting the lot if x is greater than some prescribed value, 
and rejecting it otherwise. Accepting the lot corresponds to accepting Ho that the 
manufacturing process operates well enough, and 1 — 7(9) = P(Y € C | 6) is its 
probability. For 9 € Q4,1—7(@) = B(@). 

Clearly, if Ho is simple, that is, Qo = {09}, then 


ool” aan, (7.57) 
1—B(0) if6 €Qg4. 


Example 7.3.1 (Cold Remedy, Continued). Let us determine the power function for 
the test discussed in Examples 7.2.1 and 7.2.3. In these examples n = 100, 6) = 
lio = 7, and Y = X, which is approximately normally distributed with parameters 
6 = p,and SD = 0.15. For a = 0.01, we obtained the rejection region C = {x : 
xX < 6.65}. Thus, 


m(“) = P(X €C | ) = P(X < 6.65 | w) 


Y=6 665 6.65 — 
=p ae Bee foru<7, (758) 
0.15 0.15 0.15 


and the graph of this power function is given by Figure 7.2. 

Let us examine a few values of (jz) as shown in the graph. 

For np = 7,2(7) © 0.01 = a = P(type | error) = probability of accepting a 
worthless drug as effective. 

At 4 = 6.65, the boundary of the rejection region, 7(6.65) = 0.5, which is 
reasonable, since a slightly higher jz would lead to an incorrect acceptance of Ho, 
and a slightly lower jz to a correct rejection of Ho. Thus, at 4 = 6.65 we are just as 
likely to make a correct decision as an incorrect one. 

For j-values between 6.65 and 7 the probability of (an incorrect) rejection of Ho 
decreases as it should, because jz is getting closer to 49 = 7. 


Fig. 7.2. Graph of y = m(y2). 
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For « < 6.2, w(2) is almost 1. Apparently, 6.2 is sufficiently far from 49 = 7, 
so that the test (correctly) rejects Ho with virtual certainty. 

For j-values between 6.2 and 6.65 the probability of (a correct) rejection of Ho 
decreases from | to 0.5, because the closer the true value of jz is to 6.65, the less 
likely it becomes that the test will reject Ho. 

Note that we could extend z(jz) to jz-values greater than 7, but doing so would 
make sense only if we changed Hp from yz = 7 to uw > 7. For this changed Ho we 
would have z(jz) < 0.01 =a for uw > 7. 4 


Notice that the rejection region C and the power function 7 (@) do not depend on 
the exact form of Hp and H,4. For example, the same C and z(@) that we had in the 
example above could describe a test for deciding between Ho: 6.9 < ww < 7 and Aa: 
pL < 6.9 as well. 

In general, whether Ho is composite or not, we define the size of the test to be 


a = sup 2(@) = lub P(type | error). (7.59) 
0EQO 


In the case of a simple Ho, that is, for Ho: 8 = 9, this definition reduces to a = 
(60). 


Example 7.3.2 (Testing Fairness of a Coin, Continued). Here we continue Example 
7.2.5. We test whether a certain coin is fair or not. We toss it n = 100 times and 
use the relative frequency X of heads obtained as our test statistic with the normal 
approximation, to test the value of the parameter 6 = p = P(#). We obtained 
C = (0.402, 0.598) as the rejection region for a = 0.05. Now we want to find the 
power function for this test. 

By definition z(p) = P(X e C|p), and so 


m(p) = P(X < 0.402|p) + P(X = 0.598] p) 


= X—p _ 0.402 = p 
Vv pC = p)/100 ~ Vp = p)/100 


+P X-—p 0.598 — p 
Vp — p)/100 ~ pC — p)/100 


0.402 — 0.598 — 
( . ) +1-0 ( Z ) (7.60) 
vp — p)/100 Vv p(l — p)/100 
As can be seen in Figure 7.3, z(p) has its minimum @ = 0.05 at p = 1/2 and 
equals 0.5 at the boundary points p = 0.402 and p = 0.598 of the rejection region. 


For p < 0.3 and p > 0.7 a correct rejection of Ho occurs with probability practically 
1, that is, the probability B(p) = 1 — 2(p) of a type 2 error is near zero there. 4 


When we design a test, we want to make the probabilities of both types of er- 
rors small, that is, we want a power function that is small on Q and large on Qy. 
Generally, we have a choice of the values of two variables: the sample size n and 


242 7 The Elements of Mathematical Statistics 
ie 


0.87 


0.27 


Fig. 7.3. Graph of y = 2(p). 


the boundary value c of the rejection region, and so, if we do not fix n in advance 
as in the preceding examples, then we can prescribe the size 6(@) of the type 2 error 
at some point in the rejection region in addition to prescribing a. This procedure is 
illustrated in the next example. 


Example 7.3.3 (Cold Remedy, Again). As in Example 7.3.1, we assume 6) = [Lo = 7, 
and an approximately normal Y = X with mean 6 = yp but SD = 1.5/,/n. With the 
rejection region of the form C = (—oo, c), we want to determine c and n such that 
a = 0.01 and 6(6) = 0.01 as well. These conditions amount to 


«=1) x o() = 6.01 
1.5/./n 
(7.61) 


X-—7 2 c—7 
1.5/./n ~ 1.5/./n 


PE <e1n=7)=P( 


and 


X -—6 a c—6 
1.5/J/n ~ 1.5//n 


PO =e = 6) =P( 


c—6 
=6\)x1- 0.01. 
. (4 <2) 


(7.62) 
Hence 
C=7 _ @-1O01) = -2.3263 (7.63) 
1S/jn ee eae 
and 
C= 6 _ 6-16.99) = 2.3263 (7.64) 
ise 99) = 2.3263. 


These two equations are solved (approximately) by c = 6.5 and n = 49, which yield 
the power function 
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6.2 6.4 6.6 6.8 7 


Fig. 7.4. Graph of y = (2). 


m(u) = P(X €C | we) = P(X < 6.5 | w) 


= = ia 
=p ig OO 1) og Oe for uw <7, (7.65) 
io - io 1.5/7 


whose graph is shown in Figure 7.4. 

This graph is much flatter than Figure 7.2, because here we are satisfied with 
less accuracy than in Example 7.3.1. Here we required 6(6) = 0.01, but in Example 
7.3.1 we had 6(6) * 107°. On the other hand, in the present case we can get away 
with a smaller sample, which is often a useful advantage. 


Exercises 


Exercise 7.3.1. (a) In Example 7.3.1 what is the meaning of a type 2 error? 
(b) What is the probability that we accept the drug as effective if 4p = 6.5? 


Exercise 7.3.2. (a) In Example 7.3.2 what is the meaning of a type 2 error? 
(b) What is the probability that we accept the coin as fair if p = 0.55? 


Exercise 7.3.3. As in Exercise 7.2.1, consider a large school where there are many 
sections of calculus classes and on the common final exam, the average grade is 66, 
the SD is 24 and a certain section has 32 students. We want to test whether the given 
section comes from the same population or one with a lower average but with the 
same SD, that is, test Ho: 4 = 66 against H4: uw < 66. 

(a) Find the rejection region that results in a level of significance a = 0.05. 

(b) Find and plot the power function for this test. 


Exercise 7.3.4. As in Exercise 7.2.2, consider a special diet for n cows randomly 
selected from a population of cows weighing on average 500 kilograms with an SD 
of 25 kilograms. Find the critical region and the sample size n for a test, in terms 
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of the average weight X of the cows in the sample, to measure the effectiveness of 
the diet, by deciding between Ho: w = 500 against Ha: w > 500, with a level of 
significance aw = 0.05 and 6(515) = 0.05. Find and plot the power function for this 
test, using the normal approximation. 


Exercise 7.3.5. Suppose a customer wants to buy a large lot of computer memory 
chips and tests a random sample of n = 12 of them. He rejects the lot if there is more 
than one defective chip in the sample, and accepts it otherwise. Use the binomial 
distribution to find and plot the operating characteristic function of this test as a 
function of the probability p of a chip being nondefective. 


7.4 Sampling from Normally Distributed Populations 


As mentioned before, in real life many populations have a normal or close to normal 
distribution. Consequently, statistical methods devised for such populations are very 
important in applications. 

In Corollary 6.2.3 we saw that the sample mean of a normal population is nor- 
mally distributed and in Example 7.1.7 we gave confidence intervals based on X for 
an unknown jz when o was known. 

Here we shall discuss sampling when both jz and o are unknown. In this case, 
we use the M.L.E. estimators X and ©? for je and o” (see Example 7.1.3) and first 
want to prove that, surprisingly, they are independent of each other, in spite of the 
fact that they are both functions of the same r.v.’s Xj. 

Before proving this theorem, we present two lemmas. 


Lemma 7.4.1. For a random sample from an N (1, 07) distribution, X; — X and X 
are uncorrelated. 


Proof. From Corollary 6.2.3 we know that E(X) = w,and so E(X;—X) = w—b = 
0. Let us change over to the new variables Y; = X; — zw. Then Y = (1/n) yi Yj= 


X — and 
Cov (X; —X, X) = E ((X; -X) (X —»)) = E((¥; -Y)¥) 
=E(V¥)-£(¥). (7.66) 
Now, E(¥;Y;) = E(Y)E(Y;) = 0, if i # j, and E(Y?) = o?. Thus, 
E(¥iY) = Fea = “E (¥?) = =. (7.67) 
Also, from Corollary 6.2.3 


E Ca) = Var(X) = = (7.68) 
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and therefore 
Cov (Xi - xX, Xx) =0. (7.69) 


Lemma 7.4.2. If, for any integer n > 1, (X1, X2,...,Xn) form a multivariate 
normal n-tuple and Cov(X;, X;) = 0 for alli 4 n, then Xy is independent of 
(X1, Xo, ae | Xn-1)- 


Proof. The proof will be similar to that of Theorem 6.5.1. We are going to use the 
multivariate moment generating function 


n 
W1,2,..2(81) 52, +++, 5n) =E (+ (x «xi (7.70) 
i=1 


By Theorem 6.2.6, Y = a4 s; X; is normal, because it is a linear combination 
of the original, independent, random variables Z;. Clearly, it has mean 


n 
py = Do simi (7.71) 
is] 


and variance 


n 2 n n 
of =E (x Si (Xi — M0) =E (= 3 (Xi — «0) (s sj (Xj- «)) 
i=1 i=1 j=l 


= (SoS (Xi — i) (Xj — wy; ) = ae: (7.72) 


i=1 j=1 i=l j=1 


Here o;; = Cov(X;, X;) ifi # j, and oj; = Var(X;). Thus, by the definition of the 
m.gf.of Y as wy(t) = E(e'”) and by Equation 6.46, 


W1,2,...,n (S1,52,-..,5n) = Wy(1) = exp (x Spit = a3 Si . (7.73) 
i=l j=1 
Now, separating the terms with a subscript n from the others, we get 
n—1n-1 
W1,2,...,n (S1,82,+-. Sn) = exp (5 Sifi +> 2e 2 s18501 + Span + 3% Zon] ; 


(7.74) 


because we assumed oj, = On; = 0. Hence Wj 2,. n(S1,52,..- , 5) factors as 
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n=l1n—1 
1 
exp 6 SjMi + > 3 ssi) exp (sus + ain 20m) 


2 aI 


= W1,2,...,n—-1081, 82, --- Sn) WnlSn), (7.75) 
which is the product of the moment generating functions of (X1, X2,... , Xn—1) and 
Xp. 

Now, if (X1, X2,..., Xn—1) and X,, are independent, then their joint m.g.f. fac- 
tors into precisely the same product. So, by the uniqueness of moment generating 
functions, which holds in the n-dimensional case as well, (X,, X2,... , Xn—1) and 
X, must be independent if oj, = 0 for alli. | 


We are now ready to prove the promised theorem. 


Theorem 7.4.1 (Independence of the Sample Mean and Variance). For a random 
sample from an N(w, o”) distribution, the sample mean X= (1/n) La j= Xi and 
the sample variance D2 = (1/n) yi (Xi — X)? are independent. 


Proof. X and each X; — X can be written as linear combinations of the standardiza- 
tions of the i.1.d. normal X; ie and have therefore a multivariate normal distri- 
bution. By Lemma 7.4.1, Cov(X; — X, X) = 0 for alli and, 1, by Lemma 7.4.2 applied 
to the n +1 variables X; — X and X, we obtain that (X; — X, X2-—X,...,X,—X) 
and X are independent. Hence, by an obvious extension of Theorem 45 7ton+1 
variables, X is independent of £2 = (1/n) ¥ 2.05 =x, | 


Next, we turn to finding the distribution of y?. 

First, note that the sum )7"_, (Xj — 1)*/o? is a chi-square random variable with 
n degrees of freedom. (See Definition 6.4.3.) Interestingly, the use of X in place of 
je in the definition of pe just reduces the number of degrees of freedom by | and 
leaves the distribution chi-square: 


Theorem 7.4.2 (Distribution of the Sample Variance). For a random sample from 
an N(p, o”) distribution, the scaled sample variance 


ae aa 2 ak = xX) 


o2 


is a chi-square random variable with n — | degrees of freedom. 


Proof. We can write 
n n 


gal _xyal =p = wr 
BP = — D1 (Xi - X) = — DG - w) - &- W)] 


i=l i=l 


1 n 2 _ n = 
= — (Ki — wy? — (XK — w) Ki — w+ Kw, (7.76) 
"i=l a i=l 
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and simplifying on the right-hand side, we get 


a2_ | . : a) + 2 
3? = - 2 py? — (X — p)*. (7.77) 


Multiplying both sides by n/o? and rearranging result in 


n&2 xX-—p ° "(Xi -wu ? 
l 
—= a F 7.78 
m+ (Sa) = (“S*) 78) 
The terms under the summation sign are the squares of independent standard normal 
random variables, and so their sum is chi-square with n degrees of freedom. The two 
terms on the left-hand side are independent and the second term is chi-square with | 


degree of freedom. If we denote the m.g.f. of (n D?) / (07) by w(t), then, by Theorem 
5.3.2 and Example 64.3, 


1 
wd —2)7 7 = (1-21)? ~~ fort < = (7.79) 
Hence 
1 
with = (1-217)? fort < 5 (7.80) 


which is the m.g.f. of a chi-square random variable with n — | degrees of freedom. 
a 


Example 7.4.1 (Confidence Interval for the SD of Weights of Packages). For man- 
ufacturers of various packages it is important to know the variability of the weight 
around the nominal value. For example, assume that the weight of a | lb. package of 
sugar is normally distributed with unknown oa. (It does not matter whether we know 
jt or not.) We take a random sample of n = 20 such packages and observe the value 
o@ = 1.2 oz. for the sample SD ¥. Find 90% confidence limits for o. 

By Theorem 7.4.2, (2052) /o7 has a chi-square distribution with 19 degrees of 
freedom. To find 90% confidence limits for o, we may obtain, from a table or by 
computer, the fifth and the 95th percentiles of the chi-square distribution with 19 
degrees of freedom, that is, look up the numbers ee and XG os such that 


P(x? S Xo os) = 0.05 (7.81) 
and 

P(xio S XG95) = 0.95. (7.82) 
We find x69; © 10.12 and x95 © 30.14. Therefore, 


2052 
P( 10.12 < 


—< 20.14) ~ 0.90. (7.83) 
oO 
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For © = 1.2 the double inequality becomes 


20 - 1.2? 


10.12 < ~~ < 30.14, (7.84) 
Oo 


which can be solved for o to give, approximately, 


0.98 <a < 1.69. (7.85) 


4 


As we have seen, the distribution of the statistic (n D2) / o”, used to estimate the 
variance, does not depend on jz. On the other hand, the distribution of the estimator 4 
for 2 depends on both yz and o,, and so, it is not suitable for constructing confidence 
intervals or tests for jz if o is not known. 

William S. Gosset, writing under the pseudonym Student (because his employer, 
the Guinness brewing company did not want the competition to learn that such meth- 
ods were useful in the brewery business) in 1908 introduced the statistic 

X—p 

T TW at (7.86) 
(named Student’s T with n — | degrees of freedom) which is analogous to the Z = 
(X—)/(a/./n) statistic, but does not depend on a. It is widely used for constructing 
confidence intervals or tests for 42 from small samples (approximately, n < 30) 
from a normal or nearly normal population with unknown o. For larger samples, the 
central limit theorem applies and we can use Z with G in place of o , as in Examples 
7.1.7 and 7.2.3. In fact, the density of T approaches the density of Z as n — oo. 

Next, we are going to derive the density of T in several steps. 


Theorem 7.4.3 (Density of a Ratio). [f X and Y are independent continuous ran- 
dom variables with density functions fy and fy, respectively, and fy(y) = 0 for 
y <0, then the density of U = X/Y is given by 


Oe [ iow. (787) 


Proof. We have 


xX 
Fy(u) =P(F < “) = 7 / spay ft OOF rODM dy 


=| fry) ([ fx(s)ds.) dy. (7.88) 


Hence, by differentiating under the first integral sign on the right-hand side, and 
using the chain rule and the first part of the fundamental theorem of calculus, we 
obtain 
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fu(u) = Fy@) = [ for (ee fr fx(s)de aay 


= [ fy WL Fx u)yldy. (7.89) 


Theorem 7.4.4 (Density of ./nZ/xp for Independent Z and x,). If Z is standard 
normal and xy is chi with n degrees of freedom, and they are independent of each 
other, then 


_ Viz 


Xn 


(7.90) 


has density 


+1 
r ($4) 


) (n+1)/2 
1+ for —-w <u<oO. (7.91) 
5) n 


ful) = 
Proof. Apply Theorem 7.4.3 to U = (./nZ)/xn. The density of X = ./nZ is 
1 2 
(x) = === e*/*" for — 00 < x < 00 (7.92) 
fx ia 


and, by Corollary 6.4.3, the density of Y = xp is 


2 x! 1 —x2/2 
fra) = IPP (n/2) é forO <x < oo. (7.93) 
Thus, 
CO 
2 n—1,—y?/2 1 —(yu)?/2n 
= —— ar d 
ful) i YEO)? e a y 
2 n—(2/2)1+2/n)] 
a ye dy. (7.94) 
2"? TP (n/2)/200 | . 
Substituting tf = (y?/2)[1 + (u7/n)] in the last integral, we get 
(n+1)/2 
(1 © «) id (n—1)/2 
u) = ———+—___ peat. 7.95 
fu) = om [ (7.95) 


Here, by the definition of the I’-function (page 203), the integral equals ['((n + 1)/2), 
yielding the desired result. 


Theorem 7.4.5 (Distribution of T ). For a random sample of size n from an N (1, 0”) 
distribution, Student’s statistic 


fs (7.96) 
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with n — 1 degrees of freedom, has the same distribution as the random variable 
U = J/n — 1Z/Xn-1 for independent Z and Xn~1, and its density is 


n 2 “nie 
r (5) ( an : for —-w<t<o. (7.97) 
J@—Dar (tt) \ 8-1 


fro= 


Proof. By Corollary 6.2.3, Z = (X — )./n/o is standard normal and, by Theorem 
TA2,Xn-1 = EVn/o is chi with n — | degrees of freedom. Also, they are indepen- 
dent, by Theorem 7.4.1. Thus, Theorem 7.4.4 applied to these variables, with n — 1 
in place of n, yields the statement of the theorem. a 


The density given by Equation 7.97 is called Student’s t-density with n — | de- 
grees of freedom. The values of the corresponding distribution function are usually 
obtained from tables or by computer from statistical software. 

Note that E(7) does not exist for n = 1 degree of freedom (it is the Cauchy 
distribution) and, by symmetry, E(7T) = 0 for n > 1| degrees of freedom. 

Also note that this density does not depend on yz and o, and so it is suitable for 
constructing confidence intervals or tests for jz if o is not known. 

As mentioned at the end of Example 7.1.3, most statisticians use 


n 


-~ Na 1 ey 
V=—_® Te Xn) (7.98) 


as an estimator of the unknown variance of a normal population, instead of *. Using 
the corresponding estimator 


n i a 
rts == ( Yo (Xi - z,) (7.99) 


n—-1 


for the standard deviation, we can write Student’s T as 


r= XT (7.100) 
Et Va’ 
This way of writing T brings it into closer analogy with the statistic 
X= 
7. 
o//n 


used for estimating jz when o is known. 


(7.101) 


Example 7.4.2 (Confidence Interval for the Mean Weight of Packages). As in Exam- 
ple 7.4.1, assume that the weight of a | lb. package of sugar is normally distributed 
with unknown aand consider a random sample of n = 20 such packages and observe 
the values ¥ = 16.1 oz. and G@ = 1.2 oz. for the sample mean X and SD S. Find 
90% confidence limits for jz. 
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By Theorem 7.4.5,T = (X—y)J/n — 1/= has the f-distribution with 19 degrees 
of freedom in this case. Thus, we need to determine two numbers f; and f2 such that 
P(t; < T < t2) = 0.90 for this distribution. It is customary to choose ty = —f2 = 
—t. Then, by the left-right symmetry of the ¢-distributions, we want to find ¢ such 
that P(T < —t) = 0.05. From a t-table we obtain t © 1.7291, and so 


P (-1-701 < <5 < Ln) = 0.90, (7.102) 
or, equivalently, 
P (x = 1.7291 <p<X+ 291) = 0.90. (7.103) 
V19 V19 


Substituting the observed values ¥ = 16.1 and @ = 1.2 for X and =, we get 
15.624 < uw < 16.576 (7.104) 
as a 90% confidence interval for ju. 


Example 7.4.3 (Small Sample Test for Weight Reduction). As in Example 7.2.2, we 
want to test the effectiveness of a new drug for weight reduction and administer it, 
this time, to a random sample of just 10 adult women for a month. We assume that 
the weight loss (as a positive value), from the beginning to the end of the month, of 
each of these women is i.i.d. normal. Let X denote the average weight loss and s 
the sample SD of the weight losses. Suppose we observe ¥ = 1.5 lbs. and@ = 4 
Ibs. We estimate o by the observed value of ¢ = 4 Ibs. The mean weight loss ju is 
unknown and we want to find the extent to which the observed X value supports the 
null hypothesis ~ = 0, that is, to find the P-value of the observed average weight 
reduction. 

Since the sample is small, o is unknown, and the population is normal, we may 
use the T-statistic with mean go = O and with 9 degrees of freedom for our test. 
Since H,4 is of the form 44 > lo, we take the rejection region to be of the form 
[1.5, co), that is, we reject Ho if X > 1.5 or, equivalently, if T > t, where 


X—-po _ 15-0 


t=x = = 1.125. (7.105) 
G//n—1 4//9 
From a f-table, 
P(T > 1.125) = 0.145. (7.106) 


This P-value is fairly high, which means that the probability of an erroneous 
rejection of the null hypothesis would be high or, in other words, our observed result 
can well be explained by the null-hypothesis: the weight reduction is not statistically 
significant. 4 


The test of the preceding example is called the t-test or Student’s t-test and is 
used for hypotheses involving the mean jz of a normal population when the sample 
is small (n < 30) and the SD is unknown and is estimated from the sample. 
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Exercises 


Exercise 7.4.1. The lifetimes of five light bulbs of a certain type are measured and 
are found to be 850, 920, 945, 1008 and 1022 hours, respectively. Assuming that the 
lifetimes are normally distributed, find 95% confidence intervals for jz and o. 


Exercise 7.4.2. In high precision measurements, repeated results usually vary due 
to uncontrollable and unknown factors. Scientists generally adopt the Gauss model 
for such measurements, according to which the measured data are like samples from 
a normally distributed population. Suppose a grain of salt is measured three times 
and is found to weigh 254, 276, 229 micrograms, respectively. Assuming the Gauss 
model, with jz being the true weight and o unknown, find a 95% confidence interval 
for jz, centered at xX. 


Exercise 7.4.3. At the service counter of a department store a sign says that the av- 
erage service time is 2.5 minutes. To test this claim, 5 customers were observed and 
their service times turned out to be 140, 166, 177, 132, and 189 seconds, respectively. 
Assuming a normal distribution for the service times, test Ho: 4 = 150 sec. against 
Ha: > 150 sec. Find the P-value and draw a conclusion whether the store’s claim 
is acceptable or not. 


Exercise 7.4.4. A new car model is claimed to run at 40 miles/gallon on the highway. 
Five such cars were tested and the following fuel efficiencies were found: 42, 36, 39, 
41,37 miles/gallon. Assuming a normal distribution for the fuel efficiencies, test Ho: 
ju = 40 against H4: w < 40. Find the P-value and draw a conclusion whether the 
claim is acceptable or not. 


Exercise 7.4.5. Prove that the density fr(t) of T, given by Equation 7.97, tends to 
the standard normal density g(t) asn > oo. 


Exercise 7.4.6. In Example 7.2.8, we cited a study of twins (Footnote 6), in which 
the following mean bone densities of the lumbar spine of 20 twin pairs were also 
measured for twins of each pair differing by 20 or more pack-years of smoking 
(rather than just 5 pack-years, as discussed earlier): 


Lighter smoker | Heavier smoker 
(g/cm) (g/cm) Difference 


Mean + SE | 0.794+ 0.032 | 0.726+0.032 | 0.068 + 0.020 


Assuming normally distributed data, do a t-test for the effect of smoking on bone 
density. 


Exercise 7.4.7. Prove that for T withn > 2 degrees of freedom Var(T) = n/(n — 2). 
Hint: Use the fact that the U = (./nZ)/xy in Theorem 7.4.4 has a t-distribution with 
n degrees of freedom and Z and x, are independent. 
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7.5 Chi-Square Tests 


In various applications, where a statistical experiment may result in several, not just 
two, possible outcomes, chi-square distributions provide tests for proving or disprov- 
ing an underlying theoretical prediction of the observations. 


Example 7.5.1 (Pea Color). In 1865, an Austrian monk, Gregor Mendel, published 
a revolutionary scientific article in which he proposed a theory for the inheritance 
of certain characteristics of pea plants, on the basis of what we now call genes and 
he called entities. This was truly remarkable, because he arrived at his theory by 
cross-breeding experiments, without ever being able to see genes under a micro- 
scope. Among other things, he found that when he crossed purebred yellow-seeded 
with purebred green-seeded plants, then all the hybrid seeds turned out yellow, but 
when he crossed these hybrids with each other, then about 75% of the seeds turned 
out yellow and 25% green. 

He explained this observation as follows: there are two variants (alleles) of a gene 
that determine seed color: say, g and y. Each seed contains two of these variants, and 
the seeds containing gy, yg and yy are yellow and those containing gg are green. 
(We call y dominant and g recessive.) Every ordinary cell of a plant contains the 
same pair as the seed from which it grew, but the sex cells (sperm and egg) get only 
one of these genes, by splitting the pair of ordinary cells. The purebred parents have 
gene-pairs gg and yy, and so their sex cells have g and y, respectively. (Purebred 
plants with only yellow seeds can be produced by crossing yellows with each other 
over several generations, until no greens are produced.) Thus, the first generation 
hybrids all get a g from one parent and a y from the other, resulting in type gy or yg. 
(Actually, these two types are the same; we just need to distinguish them from each 
other for the purpose of computing probabilities, as we did for two coins.) The seeds 
of these hybrids all look yellow. 

In the next generation, when crossing first generation hybrids with each other, 
each parent may contribute a g or a y to each sex cell with equal probability, and 
when those mate at random, we get all four possible pairs with equal probability. 
Since three pairs gy, yg and yy look yellow and only one, gg, looks green, p = 
P(yellow) = 3/4 = po and g = P(green) = 1/4 = qo. 

Suppose that to test the theory, we grow n = 1000 second generation hybrid 
seeds and obtain nj = 775 yellow and nz = 225 green seeds. We take Ho: p = po 
and g = qo, and Hy: p ¥ po and q ¥ qo. Karl Pearson in 1900 suggested using the 
following statistic for such problems: 


K?2 


_ (Ni = npo)? 4 (N2 — ngo)* 
nNPo ndo 


, (7.107) 


where N and N2 are the random variables whose observed values are n; and n2. 
The reason for choosing this form is that (Nj — npo)* and (N2 — nqo)* measure 
the magnitude of the deviations of the actual from the expected values, and we should 
consider only their sizes relative to the expected values. A large value of (Nj —npo)” 
indicates a relatively bigger discrepancy from the expectation when npo is small, 
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than when it is large. The fractions take care of this consideration. (The fact that the 
numerators are squared but the denominators are not, may seem strange, but it makes 
the mathematics come out right.) 

This statistic is especially useful when there are more than two possible out- 
comes. In the present case, we could just use the Z-test for p (assuming large n), 
since q is determined by p. (See Exercise 7.5.1.) We may, however, use K7, as well. 
Substituting gg = 1 — po and Nz = n — N, in Equation 7.107, we obtain 


2 af 1 1 
K* = (Nj —npo)” | — + — (7.108) 
npo qo 
or, equivalently, 
N, — 2 
x2 — N= mp0)” (7.109) 
NPoqdo 


By the CLT, the distribution of (Ni — npo)/,/npodgo tends to the standard normal 
as n —> oo, and so the distribution of K7 tends to the chi-square distribution with 
one degree of freedom. Thus, using the chi-square table and the given values of 
n,11, Po, qo, we get the large-sample approximation of the P-value of the test as 


= 2 - 2 
p( K2> (n1 — po) —p[K2> (775 =) 
npoqo 1000 - a4 
= P(x? > 3.33) © 0.068. (7.110) 


Hence, the null hypothesis can be accepted. 4 


Observe that, because of the relation N; + Nz = n together with pp + go = 
1, the sum of two dependent square terms in Equation 7.107 reduces to just one 
such term in Equation 7.109. In other words, only one of N; or N2 is free to vary. 
Similarly, if there are k > 2 possible outcomes, the relation expressing the fact that 
the sample size is n, and correspondingly the sum of the probabilities is 1, produces 
k — | independent square terms in the generalization of Equation 7.107. In fact, if we 
also use the data to estimate r parameters of the given distribution po1, Po2,--- » Pok 
(examples of this will follow), then the number of independent terms turns out to 
be k — 1 —r, which is also the number of degrees of freedom for the limiting chi- 
square random variable. Thus, the number of degrees of freedom equals the number 
of independent random variables N; that are free to vary. Hence the name “degrees 
of freedom.” 

We summarize all this as follows: 


Definition 7.5.1 (Chi-Square Test for a Finite Distribution). 

Suppose we consider an experiment with k > 2 possible outcomes, with un- 
known probabilities p;, p2,... , px, and we want to decide between two hypotheses 
Ao: pi = poi for alli = 1,2,...,k and Hy: p; ~ po; forsomei = 1,2,...,k, 
where poi, P02, --- » Pok are given. 
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We consider n independent repetitions of the experiment with the random vari- 
ables N; denoting the number of times the ith outcome occurs, fori = 1,2,...,k, 
where )“}_, N; = 1. We use the test statistic 


k 2 
N; = 5 
ay (7.111) 
i=1 "PO 


It can be proved that the distribution of K* tends to the chi-square distribution with 
k — | degrees of freedom. Furthermore, if we also use the data to estimate r parame- 
ters of the given distribution p01, Po2,.-- , Pox, then the distribution of K 2 tends to 
the chi-square distribution with k — 1 — r degrees of freedom. Thus, we obtain the 
P-value of the test approximately,’ for large n, by using the chi-square table for P = 
P(x? > X7), where 


‘(nj — npoi)? 
x = » ue Ee (7.112) 
i=l "POI 


is the observed value of K?. If P is less than 0.05, we say that Hy, is significant and 
if it is less than 0.01, highly significant and accept H4, otherwise we accept Ho. In 
particular, a small value of %7 that leads to a large P-value is strong evidence in favor 
of Ho (provided that the data are really from a random sample). 0 


It may be helpful to remember Formula 7.112 in words as 


(7.113) 


p= 3 (observed frequency — expected frequency)” 


sibeatesotiee expected frequency 


where the expected frequencies are based on Ho. 


Example 7.5.2 (Are Murders Poisson Distributed? ). In a certain state, the following 
table shows the number n; of weeks in three years with 7 murders. Can we model 
these numbers with a Poisson distribution? 


i jo} 1} 2)3 )/4 415 | 647) 8 
nj || 4 | 12 | 23 | 34 | 33 | 23 | 16] 8 | 3 


First, we have to determine the parameter of the Poisson distribution with the 
best fit to these data. Since for a Poisson random variable 1 = E(X), we should use 
x, that is, the average number of murders per week, as the estimate of 2. Thus we 
choose 


TK rough rule of thumb is that n should be large enough so that npp; > 10 for each i, 
although some authors go as low as 5 instead of 10 and, for large k, even allow a few npoj 
to be close to 1. 
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1 . 
A= — in; 
156 — 
i=1 
1 
= ppg O 441 1242-2343-344+4-334+5-234+6-164+7-8 48-3) 
= 3.7372 (7.114) 
and so, 


Sisine 


Poi = fori =0,1,2,.... (7.115) 


i! 

Some of the numbers npo; are less than 10, and so, to safely use the chi-square 
approximation, we lump those together into two categories and tabulate the expected 
frequencies as follows: 


i 0,1 2 3 4 5 6 Ty 85855 
npoi || 17.603 | 25.95 | 32.327 | 30.203 | 22.575 | 14.061 | 13.279 


Since we now have k = 7 categories and r = | estimated parameter, we use 
chi-square with 5 degrees of freedom. Thus, 


2 ., (16 — 17.603)? (23 — 25.95)? (34 — 32.327)* (33 — 30.203)? 


17.603 25.95 32.327 30.203 
(23 — 22.575) (16 — 14.061)? (11 — 13.279)? 
23575 14.061 13.279 
= 1.4935 (7.116) 


and from a table, the corresponding P-value is P(x2 > Xx) = 0.91. Thus, we have 
very strong evidence for accepting Hp, that is, that the data came from a Poisson 
distribution with A = 3.7372, except that there may be some distortion within the 
lumped categories. 4 


In the example above, we tested whether the data represented a random sample 
from a Poisson distributed population. In general, a test for deciding whether the 
distribution of a population is a specified one, is called a test for goodness of fit. 
For discrete distributions and large n, the chi-square test can be used for this pur- 
pose as above. For continuous distributions, we can reduce the problem to a discrete 
one by partitioning the domain into a finite number of intervals and approximating 
the continuous distribution by the discrete distribution given by the probabilities of 
the intervals. Although this approximation may mask some features of the original 
distribution, it is still widely used in many applications as in the following example. 


Example 7.5.3 (Grades). The grades assigned by a certain professor in several cal- 
culus classes were distributed according to the following table: 
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Points (85,100] | (70,85] | (55,70] | (40,55] | [0.40] 
Grade A B Cc D F 
Frequency 45 56 157 83 52 


Do these grades represent a random sample from an underlying normal distribu- 
tion? 

To answer this question, first we have to estimate j and o of the best fitting 
normal distribution, and then we may use a chi-square test as follows. First,n = 393, 
and we estimate pz and o,, by using the midpoints of the class-intervals, as 


1 
¥ = 357 (02.5 -45 +775 56 + 62.5 157 +.47.5 83 + 20 - 52) = 59.28, 
(7.117) 


and 


1 
O = =| (02.5. — 59.98)" 45 + (77S = 59.28)" - 56 4 (62.5 = 59.28)" + 157 
J393 


4 (47 5. = 59:28)" 5 83-4. (20 = 50.28)" - 52)" 
= 20.28. (7.118) 


Using the normal distribution with these parameters, we get the probabilities pj; 
for the class-intervals as 


100 — 59.28 85 — 59.28 
P((85, 100]) = ® ® ~ 0.10, (7.119) 
20.28 20.28 
85 — 59.28 70 — 59.28 
P((70, 85]) = ® ® ~ 0.20, (7.120) 
20.28 20.28 
70 — 59.28 55 — 59.28 
P((55, 70]) = ® ® = 0.29, (7.121) 
20.28 20.28 
55 — 59.28 40 — 59.28 
P((40, 55]) = ® ® ~ 0.25, (7.122) 
20.28 20.28 
40 — 59.28 0 — 59.28 
P((40, 55]) = ® ® ~ 0.16, (7,123) 
20.28 20.28 


and the expected numbers npo; as 


Points || (85,100] | (70,85] | (55,70] | (40,55] | [0,40] 
NPOi 39.3 78.6 114 98.3 63 


Thus, 


2 ., (45 —39.3)* (56 — 78.6)? (157-114)? (83 — 98.3)* (52 — 63)* 
~ 39.3 78.6 114 98.3 63 
27.8. (7.124) 
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The number of degrees of freedom is 5 — 1 — 2 = 2, since we had five categories and 
estimated two parameters from the data. Hence, a chi-square probability computation 
gives the P-value P( x2 > x7) © 10~°. Thus, we reject the null hypothesis, that the 
distribution is normal, with a very high degree of confidence. 4 


The chi-square test can also be used for testing independence of two distributions. 
We illustrate how by an example, first. 


Example 7.5.4 (Age and Party of Voters). We take a random sample of 500 voters 
in a certain town and want to determine whether the age and party affiliation cate- 
gories, as discussed in Examples 3.3.3 and 3.3.6, are independent of each other in 
the population. Thus, we take Ho to be the hypothesis that each age category is inde- 
pendent of each party category, and H/, that they are not independent. We shall give 
a quantitative formulation of these hypotheses below. 

Suppose the sample yields the following observed frequency table for this two- 
way classification, also called a contingency table: 


Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 41 52 60 153 

30 to 50 55 64 60 179 

Over 50 48 53 67 168 

Any age 144 169 187 500 


First, we convert this table to a table of relative frequencies, by dividing each 
entry by 500: 


Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 0.082 0.104 0.120 0.306 
30 to 50 0.110 0.128 0.120 0.358 
Over 50 0.096 0.106 0.134 0.336 
Any age 0.288 0.338 0.374 1.000 


These numbers represent the probabilities that, given the sample, a randomly 
chosen one of the 500 persons would fall in the appropriate category. Now, under 
the assumption of independence, the joint probabilities would be the products of 
the marginal probabilities. For instance, we would have P(Under 30 M Republican) 
= 0.306 - 0.288 = 0.088128. We show these products in the next table: 
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Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 0.088128 | 0.103428 0.114444 0.306 
30 to 50 0.103104 | 0.121004 0.133892 0.358 
Over 50 0.096768 | 0.113568 0.125664 0.336 
Any age 0.288 0.338 0.374 1.000 


Thus, the promised quantitative expression of Ho is the assumption that the joint 
probabilities in the population (not in the sample) of the cross classification are the 
nine joint probabilities in the table above. 


Hence, the expected frequencies under Hp are 500 times these probabilities, as 
given below: 


Age\ Party | Republican | Democrat | Independent || Any affiliation 
Under 30 44.064 51.714 57.222 153 
30 to 50 51.552 60.502 66.946 179 
Over 50 48.384 56.784 62.832 168 
Any age 144 169 187 500 
Consequently, 


no _, (41 — 44.064)? (52 —51.714)* (60 — 57.222)? 


nw 


44.064 51.714 57.222 

(55 — 51.552)? (64 — 60.502)? (60 — 66.946) 
51.552 60.502 66.946 

(48 — 48.384)? (53 — 56.784)? (67 — 62.832)? ogee 
48.384 56.784 62.832 


(7.125) 


The number of degrees of freedom is 9—4—1 = 4, because the number of terms is 
k = 9, and the marginal probabilities may be regarded as parameters estimated from 
the data and r = 4 of them determine all six (any two of the row-sums determine the 
third one, and the same is true for the column-sums). Hence, a chi-square probability 
computation gives the P-value P(x Fi > x7) © 0.73, which suggests the acceptance 
of the independence hypothesis.® 4 


The method of the example above can be generalized to arbitrary two-way clas- 
sifications: 


8 The result could be explained by nonindependent distributions as well, but a computation 
of type 2 errors would be hopeless because of the various ways nonindependence can occur. 
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Theorem 7.5.1 (Chi-Square Test for Independence from Contingency Tables). 
Suppose we want to test the independence of two kinds of categories in a population, 
with a categories of the first kind and b of the second. We take a random sample of 
size n and construct a size a x b contingency table from the observed k = ab joint 
frequencies nj;;. We convert this table to a table of relative frequencies rj; = njj/n 


and compute the row and column sums rj = YS rij and sj; = )-j-,7ij- We 
define the k joint probabilities po,ij = 1s; and take Ho to be the hypothesis that the 
joint probabilities in the population satisfy pij = po,ij for alli = 1,2,...,a and 
JH 12 eee be 
We define 
R= 5 3 (Nij — npoij)? (7.126) 
i=l i=l MP0, ij 


where the N;; are the random variables whose observed values are the n;;. The rj 
and sj; are parameters estimated from the data, but since ri = 1 and Vs; = 1, 
we need to estimate only r = a+b—2 parameters. Thus, the distribution of K* tends 
to the chi-square distribution with k — 1 —r = (a — 1)(b — 1) degrees of freedom. 
We obtain the P-value of the test approximately, for large n, by using a chi-square 
table for P = P(x* > x7), where 


Sg. a5 hy — poe 
P= Vy (7.127) 


i=1 i=l "PO,ij 
is the observed value of K?. 


There exists still another use of chi-square, which is very similar to the one above: 
testing contingency tables for homogeneity. In such problems, we have several sub- 
populations and a sample of prescribed size from each, and we want to test whether 
the probability distribution over a set of categories is the same in each subpopula- 
tion. If it is, then we call the population homogeneous over the subpopulations with 
respect to the distribution over the given categories. For example, we could modify 
Example 7.5.4 by taking the three age-groups as the subpopulations, deciding how 
many we wish to sample from each group, and testing whether the distribution of 
party affiliation is the same in each age group, that is, whether the population is ho- 
mogeneous over age with respect to party affiliation. We do such a modification of 
Example 7.5.4 next: 


Example 7.5.5 (Testing Homogeneity of Party Distribution Over Age Groups). Sup- 
pose we decide to sample 150 voters under 30, 200 voters between 30 and 50, and 
250 voters over 50, and want to test whether the distribution of party affiliation is the 
same in each age group of the population. We observe the following sample data: 
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Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 41 35 54 150 
30 to 50 52 66 82 200 
Over 50 61 83 106 250 
Any age 154 204 242 600 


Under Ho, which is the assumption of homogeneity, the most likely probability 
distribution of party affiliation can be obtained by dividing each column sum by 600, 
and then the expected frequencies can be computed by multiplying these fractions by 
each row sum. Thus, for instance, P(Republican) = 154/600 and E(n(Under 30 1 
Republican)) = (154/600) - 150 = 38.5. As this calculation shows, under the present 


Ho, the expected frequencies are computed exactly as in the test for independence, 
and we find them as 


Age\Party | Republican | Democrat | Independent |} Any affiliation 
Under 30 38.50 51.00 60.50 150 
30 to 50 51.33 68.00 80.67 200 
Over 50 64.17 85.00 100.83 250 
Any age 154 204 242 600 
Thus, 


2, 41 — 38.50)? (55—51)*  (54-—60.5)* (52 — 51.33)? 


38.50 51 60.5 51.33 
n (66 — 68)2 (82 — 80.67)? (61 — 64.17)? (83 — 85)? 
68 80.67 64.17 85 
(106 — 100.83)? 
pe ee eee (7.128) 
100.83 


The number of independent data is k = 6, because in each row the three frequen- 
cies must add up to the given row-sum and so only two are free to vary. We estimated 
r = 2 independent column sums as parameters. Thus, the number of degrees of free- 
dom is k —r = 4, the same as in Example 7.5.4. (We do not need to subtract 1, 
because of the fact that the sum of the joint frequencies is 600 has already been used 
in eliminating a column-sum.) Hence, a chi-square probability computation gives the 


P-value P( xe > %*) © 0.78. This is strong evidence for accepting the hypothesis of 
homogeneity. 4 


We can generalize Example 7.5.5: 


Theorem 7.5.2 (Chi-Square Test for Homogeneity). [f a population is made up 
of several subpopulations and we want to test whether the distribution over certain 
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categories is the same in each subpopulation, then we take a sample of prescribed 
size from each subpopulation and construct a contingency table from the observed 
frequencies for each category in each subpopulation. We compute chi-square and 
the number of degrees of freedom exactly as in the test for independence and draw 
conclusions in the same way. 


Exercises 


Exercise 7.5.1. Use the Z-test for p in Example 7.5.1 (assuming large 1) instead of 
the chi-square test, and show that it leads to the same P-value. 


Exercise 7.5.2. The grades assigned by a certain professor in several calculus classes 
to n = 420 students were distributed according to the following table: 


Points || (85,100] | (70,85] | (55,70] | (40,55] | [0,40] 
Grade A B Cc D F 
Freq. 40 96 138 99 47 


Over the years, the calculus grades in the department have been normally dis- 
tributed with x. = 63 and o = 18. Use a chi-square test to determine whether the 
professor’s grades may be considered to be a random sample from the same popula- 
tion. 


Exercise 7.5.3. Explain why in the chi-square test for homogeneity, just as in the chi- 
square test for independence, the number of degrees of freedom is (a — 1)(b — 1), 
where now a is the number of subpopulations and b the number of categories. 


Exercise 7.5.4. Assume the same data as in Example 7.5.5 and set up a chi-square 
test to test the homogeneity of age distribution over party affiliation, that is, test 
whether this sample indicates the same age distribution in each party. What general 
conclusion can you draw from this example? 


Exercise 7.5.5. My calculator produced the following list of twenty random num- 
bers: 0.366, 0.428, 0.852, 0.602, 0.852, 0.598, 0.766, 0.627, 0.432, 0.939, 0.618, 
0.217, 0.002, 0.060, 0.391, 0.004, 0.099, 0.288, 0.630, 0.499. 

Does this sample support the hypothesis that the calculator generates random 
numbers from the uniform distribution (apart from rounding) over the interval [0, 1]? 


Exercise 7.5.6. In an office, the numbers of incoming phone calls in thirty ten-minute 
periods were observed to be 3, 2, 1,0,0,1,4,0,0,1,1,1,2,3,2,0,0,1,2,1,1, 1, 
2,2,3,3,2,1,4,1. 

Does this sample support the hypothesis that the number of calls follows a Pois- 
son distribution? 


7.6 Two-Sample Tests 263 


Exercise 7.5.7. The following table shows the numbers of students distributed ac- 
cording to grade and sex in some of my recent elementary statistics classes: 


Sex\Grade |} A | B | C | D| F/ P 
M 5 | 6},4) 4 |7]5 
F 9/11; 9)11;,9]8 


Does this sample support the hypothesis that in such classes grades are indepen- 
dent of sex? 


7.6 Two-Sample Tests 


In many situations, we want to compare statistics gathered from two samples. For ex- 
ample, in testing medications, patients are assigned at random to two groups when- 
ever possible: the treatment group, in which patients get the new drug to be tested, 
and the control group, in which patients get no treatment. To avoid bias, the assign- 
ment is usually double blind, that is, neither the patients, nor the physicians know 
who is in which group. To ensure this blindness, the patients in the control group are 
given something like a sugar pill (called a placebo), that has no effect, and both the 
patient and the administering physician are kept in the dark by an administrator who 
keeps a secret record of who received a real pill and who received a fake one. 

Other two-sample situations involve comparisons between analogous results, 
such as exam scores, incomes, prices, various health statistics, etc. in different years, 
or between different groups, such as men and women or Republicans and Democrats, 
and so on. 

Comparing the means of two independent normal or two arbitrary, large samples 
is very easy: 


Definition 7.6.1 (Two-Sample Z-Test). In this test we compare the unknown means 
[41 and {2 of two populations, using two independent samples either (a) of arbitrary 
sizes n, and nz from two normal distributions with known oj and 02 or (b) of large 
sizes n, and n2 from any distributions so that X and X> are nearly normal by the 
CLT. The null hypothesis is Ho: “#1 = 2, or equivalently, u = uy — uw2 = 0 
and we want to test against one of the alternative hypotheses Hy: u > 0, uw < 
0, or wu ¥ O. Thus, in these two cases, X, — X2 is normal or may be taken as 
normal. Hence, writing a and oa for the observed sample variances, we use the test 
statistics, standard normal under Ho, 


Z = ——— (7.129) 
in case (a), and 


=a (7.130) 
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in case (b), where 


o 
ox¥ =,/— += (7.131) 
ny n2 
and 
a [84 % 
y =4 (7.132) 
ny{ nz 
From this point on, we proceed exactly as in the one-sample Z-test. 4 


We can verify that the distribution of the Z above is standard normal in case (a) 
and nearly so in case (b): 

Under Ho, in case (a), X; and X> are independent sample means of samples of 
sizes ny and nz from two normal populations with common mean jz and standard 
deviations o1 and o7. Thus, the means of X; and X> are both the same p and their 
variances are 07 /n, and o /n2. Hence, X 1 — X> is normal with mean 0 and stan- 


dard deviation oy = Fl (07/1) + (o3/nz), and so Z = (X| — X2)/ox is standard 
normal. In case (b), we just need to replace o; and o> by their estimates from the 
samples. 


Example 7.6.1 (Exam Scores of Men and Women). Ona calculus test at a certain large 
school, a random sample of 25 women had a mean score of 64 and SD of 14 anda 
random sample of 25 men had a mean score of 60 and SD of 12. Can we conclude 
that the women at this school do better in calculus? 

We use a large-sample Z-test for the difference 44 = 1 — [42 of the two mean 
scores, with jz; denoting the women’s mean score in the population and j12 that of 
the men. We take Ho: w = 0 and Hy, : yw > O. The test statistic is X, — Xo, with 
X, denoting the women’s mean score in the sample and X> that of the men. The 
rejection region is {x1 — X2 > 64 — 60}. We may assume that, under Ho, X, — Xo 
is approximately normal with SD \/(142 + 122)/25 ~ 3.7. Thus, we obtain the P- 
value as P(X — Y > 4|Ho) = P(X — Y)/3.7 > 4/3.7) © 1 — 0(4/3.7) © 0.14. 
Consequently, we accept the null hypothesis that the men and women have the same 
average score in the population; the discrepancy in the samples is probably just due 
to chance caused by the random selection process. 


Example 7.6.2 (Osteoarthritis Treatment). D. O. Clegg et al.? have studied the ef- 
fects of the popular supplements glucosamine, chondroitin sulfate, and the two in 
combination for painful knee osteoarthritis. Among many other results, they found 
that 188 of 313 randomly selected patients on placebo obtained at least 20% decrease 
in their WOMAC pain scores and 211 of 317 randomly selected patients on the com- 
bined supplements obtained a similar decrease. Do these results show a significant 
effect of the supplements versus the placebo? 


2D. O. Clegg et al. Glucosamine, Chondroitin Sulfate, and the Two in Combination for 
Painful Knee Osteoarthritis. NEJM. Feb. 2006. 
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Now, the sample proportions P, and Py of successful decreases are binomial 
(divided by n) with expected values p} = 188/313 © 0.601 and p2 = 211/317 © 
0.666. We take Ho: p2 = p; and Ha: p2 > py. For the computation of G we use 
the pooled samples with p = 399/630. Thus, under Hp, Py = P, is approximately 
normal with mean yz = 0 and 


a i. 4 
s=[ra—m(t+2) 
ny n2 


= | (399/630) { 1 ee : ee |) ~ 0.0384 
~ 630) \313 317/ =O" tCS 


Hence, P(P, = P > 0.065) © 1 — 6(0.065/0.0384) + 1 — &(1.693) © 0.045. The 
effect seems to be just barely significant. 

Note, however, that the authors reported an unexplained P-value of 0.09 and drew 
the conclusion that the result was not significant. The discrepancy is probably due to 
their use of a two-tailed test, but that seems to be unwarranted, since we want to test 
the efficacy of the supplements, and not their absolute difference from the placebo. 
Among their other results, however, they reported a much more significant response 
to the combined therapy for patients with moderate-to-severe pain at baseline, than 
the numbers above for all patients. (See Exercise 7.6.4.) 

The very high placebo effect can probably be explained by the patients’ use of 
acetaminophen in addition to the experiment. 4 


The t-test can also be generalized for two independent samples: 


Definition 7.6.2 (Two-Sample f-Test). In this test we compare the unknown means 
{41 and 42 of two normal populations with unknown common o = oj = 02, using 
independent, small samples of sizes nj and nz, respectively, from the two normal 
distributions. Again, the test hypotheses are Ho: 1 = 2, or equivalently, ~ = 
[41 — 2 = 0, and one of the alternatives H4: uw > 0, uw < 0,or uw #4 0. Under Ao, 


with 
= oS? ¥ 
SS a BLES (7.133) 
no n\ ny tn2—2 


| eae (7.134) 


the test statistic 


has a f-distribution with n; + nz — 2 degrees of freedom. 


We consider this test only under the assumption 0) = 02, which is very reason- 
able in many applications. For instance, if the two samples are taken from a treatment 
group and a control group, then the assumption underlying Ho, that the treatment has 
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no effect, would imply that the two populations have the same characteristics, and so 
not only their means but also their variances are equal. The case 01  o2 is discussed 
in more advanced texts. 

We can verify the distribution of the T above as follows: 

Under Hp, X; and X> are sample means of samples of sizes n; and nz from a 
normal population with mean jy and standard deviation o. Thus, their means are the 
same 4 and their variances are o2 /n, and o2 /n2. Hence, X, — X> is normal with 
standard deviation o¢ = J (o2/n1) + (o2/n2), and Z = (X1 — X2)/oxz is standard 
normal. 

By Theorem 7.4.2, 


SF 1 m _ 9 
pe and 


Nr 1 n ae 
52 = ot Da (a —X>) (7.135) 
are chi-square random variables with n; — 1 and n2 — | degrees of freedom, respec- 
tively. Thus, 

OP Se 
yosba2 (7.136) 


a2 o? 


is chi-square with n; + n2 — 2 degrees of freedom. 
Hence, by Theorem 7.4.5, 


=2 
uaz) (7.137) 


has a f-distribution with n; + n2 — 2 degrees of freedom. Now, we show that this UV 
is the same as the T in Equation 7.134. 
Indeed, oF — S?/ny and = — S3/n2, and so, 


nae nyd5 


Vis ee (7.138) 
Thus, 
U= X,—X2 mtn—2  Xi-—X2 {ny +n2—2 
~ 2 a ny D2 ny&5 ~ / 1 1 n 2 +n y2 
= a = =s = = n{ + n2 : 1 2 
— — ni tn2—2 nin 
= (X; — X2) =a) a 
nyhy+nghy Ny +N2 
= 1 ny tn —2 (X1 — X2) 
= (xy Xx ) e2 e2° = 
ie LE) 2 Di 33 /_mi+n2 
min2 n2 a rat : nitn2—2 
=T, (7.139) 


as was to be shown. 
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Example 7.6.3 (Cure for Stuttering in Children). Mark Jones et al.'° conducted an 
experiment in which they compared a treatment, called the Lidcombe Programme, 
with “no treatment” as control. Here we describe a much abbreviated version of the 
experiment and the results. 

“The children allocated to the Lidcombe programme arm received the treatment 
according to the programme manual. Throughout the programme, parents provide 
verbal contingencies for periods of stutter free speech and for moments of stuttering. 
This occurs in conversational exchanges with the child in the child’s natural envi- 
ronment. The contingencies for stutter free speech are acknowledgment (“that was 
smooth”), praise (“that was good talking”), and request for self evaluation (“were 
there any bumpy words then?’’). The contingencies for unambiguous stuttering are 
acknowledgment (“that was a bit bumpy’’) and request for self correction (“can you 
say that again?”). The programme is conducted under the guidance of a speech 
pathologist. During the first stage of the programme, a parent conducts the treatment 
for prescribed periods each day, and parent and child visit the speech pathologist 
once a week. The second stage starts when stuttering has been maintained at a fre- 
quency of less than 1.0% of syllables stuttered over three consecutive weeks inside 
and outside the clinic and is designed to maintain those low levels.” 

The authors measured the severity of stuttering (% of syllables stuttered) before 
randomization (that is, the random assignment of children to the two groups) and af- 
ter nine months. They assumed that the hypothetical populations corresponding to the 
two groups were normal and independent, and consequently, they used a two-sample 
t-test. They obtained the following means and SD’s (the latter in parentheses): 


Treatment | Control 
n 27 20 
Before 6.4 (4.3) 6.8 (4.9) 
At nine months | 1.5 (1.4) 3.9 (3.5) 


At nine months, from Equation 7.132, OF,_¥, = J (1.42/27) + (3.52/20) ~ 
0.8. Thus, a 95% confidence interval for the difference 6 = jz2 — fy between the 
two populations in average % of syllables stuttered at nine months is approximately 
2.4+ 1.6. 

Apparently, the authors made no use of the “before randomization” figures. They 
should have compared the improvements of the two groups: 6.4 — 1.5 = 4.9 to 
6.8 — 3.9 = 2.9, rather than just the end results. We cannot do this comparison 
from the data presented, because we have no way of knowing the SD’s of these 
differences. The “before” and “after” figures are not independent, since they refer to 
the same children, and so we cannot use Equation 7.132 to compute the SD’s of the 
improvements. The only way these SD’s could have been obtained, would have been 
to note the improvement of each child, and to compute the SD’s from those. 


10 Randomised controlled trial of the Lidcombe programme of early stuttering intervention. 
Mark Jones et al., BMJ Sep. 2005. 
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To test the significance of the nine months results, the authors considered Ho: 
6 = O and Hy: 56 > O. Under Hp the t-value for the difference is about 
2.4/[0.8 - /(27 + 20)/(27 + 20 — 2)] © 2.9 with ny + ng — 2 = 45 degrees of 
freedom. By statistical software, P(T > 2.9) ~ 0.003. This result is highly signifi- 
cant: the treatment is effective. 


Next, we present a test for comparing the variances of two independent normal 
populations. Since the normalized sample variances from Equation 7.135 are chi- 
square, with ny — | and nz — 1 degrees of freedom, respectively, it is customary to 
compare the unbiased sample variances 


and 
bes 1 i 


i= > (Xai — X2)” (7.140) 


to each other. For this comparison we use their ratio, rather than their difference, 
because when 0, = 02 = o, the sampling distribution of the difference depends on 
o,, but that of the ratio does not. 

Such a ratio has a special, somewhat unfortunately named distribution, because 
it conflicts with the notation for d.f.’s. It was so named in honor of its discoverer, 
Ronald A. Fisher. 


Definition 7.6.3 (F-Distributions). Let x,;, and y,; be independent chi-square ran- 
dom variables with m and n degrees of freedom, respectively. Then 


2 2. 
m nN 
Finn = Xmi™ = ee (7.141) 
Xi/n MX; 


is said to have an F-distribution with m and n degrees of freedom. 


Theorem 7.6.1 (Density of F-Distributions). The density of the Fin.» above is given 
by 


0 ifx <0 
f@)= x(m/2)-1 (7.142) 
aan YT 


where 


r (24") m/2y,n/2 
ee (7.143) 
P(3)0 (3) 


Proof. Theorem 6.120 gives the density of a chi-square variable as 
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0 ifx <0 
fy20) = 1 (7.144) 
a (n/2)-1 ,-x/2; 
2"P2E(n/2)~ n e ‘ ifx > 0. 
Hence, by Example 4.3.1, the density of 42 /nis 
0 ifx <0 
Fagin) = ” (nx\C/-le-@N/2 Gtx > 0. ee 


2"/21(n/2) 
Now, we apply Theorem 7.4.3 to the ratio of the two scaled chi-square random 


variables in Definition 7.6.3, with densities as given in Equation 7.145: For x > 0, 


fo) = [ vfyan 0) F2 fm Y)Y 


aes n 
~ I *2"PE (n/2) 
x (mxy)O/2)—! ema 2qy 
m2 yn/2 x (m/2)-1 


2(m+)/2P(m/2)T (n/2) 0 


m 


(n/2)-1,,—(ny) /2 
(ny) 2/2 (m /2) 


in Lora /2I—1 ola yl/2 gy (7.146) 


If we change the variable y to u = [(mx + n)y]/2 in the last integral, then we get 


fOr) 7 mi/2yn/2 x (m/2)-1 Qmtn)/2 [ yllntn)/21-1 6-4 
24") /20(m/2)T(n/2) (mx +nyetrny/2 Jo ~ 
min/2yn/2.(m/2)—1 r (“4*) 
= z (7.147) 


T(m/2)P(n/2) (mx + nymrn/2° 
B 


Note that the explicit expression for the F'-density is not very useful. We obtain 
associated probabilities from tables or by computer. 


Definition 7.6.4 (F-Test). We use this test for comparing the standard deviations oj 
and o2 of two normal populations with unknown o; and 02 and arbitrary jz; and p12. 
We take two independent samples of arbitrary sizes n,; and nz, respectively, from 
the two populations. The null hypothesis is Ho: oj = o2, and we test against one 
of the alternative hypotheses H4: 01 < 02,0, > 02, or oj # 02. We consider the 
test statistic A / Ve (see Equation 7.140), which has an F-distribution with n; and n2 
degrees of freedom under Ho. 


Let us mention that this test is very sensitive to deviations from normality, and if 
the populations are not very close to normal, then other tests must be used. 
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Example 7.6.4 (Oxygen in Wastewater). Miller and Miller'! discuss the following 
example. A proposed method for the determination of the chemical oxygen demand 
of wastewater is compared with the accepted mercury salt method. The measure- 
ments are assumed to come from independent normal populations. The new method 
is considered to be better than the old one, if its SD is smaller than the SD of the old 
method. 

The following results were obtained: 


ji (mg/L) | V (mg/L) 
1. Standard Method 72 10.96 
2. Proposed Method 72 2.28 


Thus, we use the test statistic F5,7, which now has the value 10.96/2.28 © 4.8. 
The null hypothesis is Ho: 0; = 02, and the alternative is H4: 01 > o2. Thus, the 
P-value is the probability of the right tail. Statistical software gives P(F's,7 > 4.8) © 
0.03. This result is significant, that is, we accept H, that the new method is better. 


Exercises 


Exercise 7.6.1. St. John’s wort extract (hypericum) is a popular herbal supplement 
for the treatment of depression. Researchers in Germany conducted an experiment, 
in which they showed that it compares favorably with a standard drug called parox- 
etine.!* Among other results, they found the following mean decreases on the 
Montgomery—Asberg depression rating scale, from baseline to day 42: 


Hypericum | Paroxetine 
n 122 122 
mean (SD) | 16.4 (10.7) | 12.6 (10.6) 


Find the P-value of a two-sample z-test, to show that the superior efficacy of St. 
John’s wort extract is highly significant. 


Exercise 7.6.2. Show that a random variable X has at distribution with n degrees of 
freedom if and only if X* has an F distribution with 1 and n degrees of freedom. 


Exercise 7.6.3. Prove that E(Fin.n) = n/(n — 2) ifn > 2. Hint: Use the indepen- 
dence of the chi-square variables in the definition of Finn. 


'l Statistics for Analytical Chemistry, J.C. Miller and J. N. Miller. 

!2 Acute treatment of moderate to severe depression with hypericum extract WS 5570 (St 
John’s wort): randomised controlled double blind non-inferiority trial versus paroxetine. 
A. Szegedi, R. Kohnen, A. Dienel, M. Kieser, BMJ, March 2005. 
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Exercise 7.6.4. D. O. Clegg et al. (Footnote 9, page 264) reported a highly signifi- 
cant response to the combined therapy with glucosamine and chondroitin sulfate for 
patients with moderate-to-severe pain at baseline. They found that 38 of 70 such, ran- 
domly selected patients on placebo obtained at least 20% decrease in their WOMAC 
pain scores and 57 of 72 such, randomly selected patients on the combined supple- 
ments obtained a similar decrease. Find the P-value of the effect of the supplements 
versus the placebo. 


7.7 Kolmogorov—Smirnov Tests 


In the 1930s two Russian mathematicians A. N. Kolmogorov and N. V. Smirnov 
developed several goodness of fit tests, two of which we are going to describe here. 
The first of these tests is designed to determine whether sample data come from a 
given distribution, and the second test, whether data of two samples come from the 
same distribution or not. These are instances of nonparametric tests. 

These tests use a distribution function constructed from sample data: 


Definition 7.7.1 (Empirical or Sample Distribution Function). Let x), x2, ... , Xn 
be arbitrary real numbers, and assign probability 1/n to each of these numbers. The 
distribution function corresponding to this probability distribution is called the cor- 
responding empirical or sample distribution function F’, (x). In other words, 


1 
F(x) = — - (number of x; < x). (7.148) 
n 
Clearly, F,,(x) is a step function, increasing from 0 to 1, and is continuous from 
the right. If the x; values are distinct, then F;,(x) has a jump of size 1/n at each of 
the x; values. If the x; are sample values from a population with continuous F,, then 
they should be distinct, although in practice they may not be, because of rounding. 


Example 7.7.1 (An Empirical Distribution Function). The graph in Figure 7.5 shows 
the empirical distribution function for a sample of size n = 4 with distinct x; values. 
It has jumps of size 1/4 at the sample values x1, x2, x3, x4. . 


The tests of Kolmogorov and Smirnov use the following quantity as a test statis- 
tic: 


Definition 7.7.2 (Kolmogorov—Smirnov Distance). The Kolmogorov—Smirnov (K— 
S) distance of two distribution functions F and G is defined as the quantity 


d= sup |F(x) — G(x). (7.149) 


(See Figure 7.6.) 
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Fig. 7.5. 


Lemma 7.7.1 (Alternative Expression for a K-S Distance). /f F is a continuous 
df. and F, an empirical df. for distinct x; values, then the K—S distance of F and 
F, is given by 


1 
d, = max (max { Fats) — F(x;), FQ) — Fr) + -|) : (7.150) 
l<i<n n 
Proof. Assume that the x; values are in increasing order and let x9 = —oo. 


Clearly, sup, |F (x) — F,(x)| must be attained at one of the x; values. For, 
on an interval [xj-1,x;) or (—oo, x1), where F;,(x) is constant, F(x) — F(x) 
is increasing (together with F'(x)), and so its supremum is reached at the right 


y = FQ) 
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endpoint x; of the interval, and its minimum at the left endpoint x;_1 of the in- 
terval. Thus, sup,, }<,<x; |F(*) — Fn(x)| is the larger of the vertical distances 
|F (xj-1) — Fn (xi-1)| and | F (x;) — Fn (x;)|, and so sup, | F(x) — Fy (x)| is the largest 
of the 2n such distances. This maximum distance can also be found by first finding 
the larger of the two distances at each x;, that is, the larger of the vertical distances 
from the graph of F to the two corners of the graph of F,, at x; and then finding 
the maximum of those as i varies from | to n. This procedure can be done without 
absolute values as stated in the lemma. Oo 


Definition 7.7.3 (One-Sample Kolmogorov—Smirnov Test). Suppose we want to 
test whether a certain random variable X has a given continuous d.f. F (the null 
hypothesis) or not (the alternative hypothesis). 

Consider a random sample of X with distinct observed values x1,x2,... ,Xp. 
Construct the corresponding empirical d.f. F;,, and find the K—S distance d,, between 
F and F,,. Tables and software are available for the null distribution of the random 
variable 


l<i<n 


Dy, = max (max {Fat — F(X;), F(X) — Fy(Xi) + “}) : (7.151) 
n 


For small samples, use one of those to find the P-value P(D, > d,). For large n, use 
the formula 


2 a 2.2 
PD. /oc) a2) eI ==. 7.152 
(>.> 2.) Yep te (7.152) 


k=1 
Reject the null hypothesis if P is small and accept it otherwise. 
Remarks. 


1. The null distribution of D,, does not depend on F’, that is, a single table works 
for any continuous F’. 

2. For discrete random variables, the P-values given in the table are only upper 
bounds, that is, the true P-value can be much smaller than the one obtained from 
the table. Thus, the K-—S test can also be used for discrete random variables if it 
leads to the rejection of Ho. 

3. F must be fully specified. If parameters are estimated from the data, then the 
test is only approximate. Separate tables have been obtained by simulation for 
the most important parametric families of distributions to deal with this problem; 
we do not discuss them. 

4. The test is more sensitive to data at the center of the distribution than at the tails. 
Various modifications have been developed to correct for this problem; we do 
not discuss them. 

5. For small samples, the test has low power for type 2 errors, that is, it accepts the 
null hypothesis too easily when it should not. 
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Example 7.7.2 (Are Grades Normal? ).Inasmall class, the grades on a calculus exam 
were 12, 19, 22, 43,52,56, 68, 76, 88,95. Do they come from a normal distribution? 

First, we compute jz and o of the data, and then we use a K-S test as follows: We 
find yp © 53 and o ® 29 and we take F to be the normal d_-f. with these parameters. 
Also, n = 10, and so F,, is a step function with jumps of size 1/10. Below, we 
tabulate the values of F'(x;), Fn(xi), gr = Fy(xj) — F(x), andd, = F(x) — 
F,(x%)) + A/n) = CU /n) - d;*, for each grade x;: 


Xj 12 19 22 43 52 56 68 76 88 95 
F (x;) || 0.079 | 0.121 | 0.143 | 0.365 | 0.486 | 0.541 | 0.698 | 0.786 | 0.886 | 0.926 
Fy(xj) |} 0.1 | 0.2 0.3 0.4 | O05 | 0.6 | 0.7 | 0.8 | 0.9 1 

d**  ||0.021 | 0.079 | 0.157 | 0.035 | 0.014 | 0.059 | 0.002 | 0.014 | 0.014 | 0.074 

d,_|}0.079 | 0.021 | —.057 | 0.065 | 0.086 | 0.041 | 0.098 | 0.086 | 0.086 | 0.026 


Hence d, * 0.157. In the table, the entry form = 10 under P = 0.20 is 0.322. 
A smaller d, supports Ho more strongly. Thus, 0.157 would fall under a P-value 
considerably higher than 0.20. So, we accept the null hypothesis: the grades may 
well come from a normal distribution. 


Definition 7.7.4 (ITwo-Sample Kolmogorov—Smirnov Test). Suppose we want to 
test whether two independent random samples of sizes m and n respectively, have 
the same continuous d_f. (the null hypothesis) or not (the alternative hypothesis). 

Let F,,(x) and G,,(x) denote the empirical distribution functions of the two sam- 
ples. Compute their K-S distance 


dinn = Sup | Fin (x) — Gn(x)|. (7.153) 
x 


We have tables and software for the null-distribution of the corresponding test statis- 
tic Dn. For small samples, use one of those to find the P-value P(Dnyz > dinn). For 
large samples, use the formula 


00 
P (Pm 2c = ") ~2 Vente”, (7.154) 


mn 1 
Reject the null hypothesis if P is small and accept it otherwise. 


Example 7.7.3 (Grades of Men and Women). Suppose that on an exam in a large 
statistics class, the grades of m = 5 randomly selected men were 25, 36, 58, 79, 
96 and the grades of n = 6 randomly selected women were 32, 44, 51, 66, 89, 93. 
Use the two-sample K-S test to determine whether the two sets come from the same 
distribution. 

We want to use the result of Exercise 7.7.3 to compute dj. We list the necessary 
quantities in the following table: 
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Zi 25 | 36 | 58 | 79 | 96 
Fin (Zi) 1/5 | 2/5 | 3/5 | 4/5 | 1 
Gn(zi) 0 | 1/6 | 3/6 | 4/6} 1 

|Fin (zi) — Gn(zi)| || 6/30 | 7/30 | 3/30 | 4/30 | 0/30 


ie 32 | 44 | 51 | 66 | 89 | 93 

Fin Zi) 1/5 | 2/5 | 25 | 35) 45 || 45 
Gy(Zi) 1/6 | 2/6 | 3/6 | 4/6 | 5/6 1 

|Fin (Zi) — Gn(zi)| || 1/30 | 2/30 | 3/30 | 2/30 | 1/30 | 6/30 


Hence, dmn = 7/30. The critical value at m = 5 and n = 6 in the two-sample 
K-S table for a = 0.05 is 20/30. Since dn is less than this, we accept the null 
hypothesis, that the men and women have the same grade distribution, at the 5% 
level. In fact, the P-value is apparently much higher than 0.05. (In general, a small 
dmn Value supports the null hypothesis, while a high one supports the alternative.) 


Exercises 


Exercise 7.7.1. Use the one-sample K-—S test to determine whether a sample of size 
n = 300 comes from a population with a given continuous df. F, if d, = 0.06. 


Exercise 7.7.2. Suppose the grades in a class were 20, 70, 20, 40, 70, 50, 50, 70, 80, 
80. Find and plot the empirical df. of this sample. 


Exercise 7.7.3. Let x1,x2,...,Xm and yj, y2,..., Yn be the observed values of 
two samples. Let {z1, Z2,... , Zz} = {X1, X2,... ,Xm}U {1, y2,--- , Yn}. Prove that 
din = max; | Fin (Zi) — Gn(Zi)|. 


Exercise 7.7.4. In Exercise 7.5.5 twenty random numbers from a calculator were 
given and the chi-square test was used to decide whether the calculator generates 
random numbers from the uniform distribution (apart from rounding) over the inter- 
val [0, 1]. Answer the same question using the K-S test. 


Exercise 7.7.5. Suppose we have two samples of sizes m = 200 and n = 300, 
respectively, and we find dj, = 0.08. Use the two-sample K-S test to decide whether 
to accept Hp that they come from the same population. 


Exercise 7.7.6. Suppose the grades in samples other than in Example 7.7.3 were 
found to be 25, 28, 39, 52, 75, 96 for the men and 38, 44, 51, 66, 89, 93, 98 for the 
women. Use the two-sample K—S test to determine whether the two sets come from 
the same distribution. 
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Table 1. Standard normal df. 


O(Z)= 


liz 


1 


2 
e 2 dy = P(Z <z) 


N 


0 


2 


3 


4 


5 


6 


| 


8 


9 


0.0 
0.1 
0.2 
03 
04 
05 


0.6 
0.7 
08 
0.9 
1.0 


1.1 
1.2 
13 
14 
15 


1.6 
17 
1.8 
19 
2.0 


2.1 
2.2 
23 
24 
2.5 


2.6 
Qh 
2.8 
2.9 
3.0 


0.5000 
0.5398 
0.5793 
0.6179 
0.6554 
0.6915 


0.7257 
0.7580 
0.7881 
0.8159 
0.8413 


0.8643 
0.8849 
0.9032 
0.9192 
0.9332 


0.9452 
0.9554 
0.9641 
0.9713 
0.9772 


0.9821 
0.9861 
0.9893 
0.9918 
0.9938 


0.9953 
0.9965 
0.9974 
0.9981 
0.9987 


0.5080 
0.5478 
0.5871 
0.6255 
0.6628 
0.6985 


0.7324 
0.7642 
0.7939 
0.8212 
0.8461 


0.8686 
0.8888 
0.9066 
0.9222 
0.9357 


0.9474 
0.9573 
0.9656 
0.9726 
0.9783 


0.9830 
0.9868 
0.9898 
0.9922 
0.9941 


0.9956 
0.9967 
0.9976 
0.9982 
0.9993 


0.5120 
0.5517 
0.5910 
0.6293 
0.6664 
0.7019 


0.7357 
0.7673 
0.7967 
0.8238 
0.8485 


0.8708 
0.8907 
0.9082 
0.9236 
0.9370 


0.9484 
0.9582 
0.9664 
0.9732 
0.9788 


0.9834 
0.9871 
0.9901 
0.9925 
0.9943 


0.9957 
0.9968 
0.9977 
0.9983 
0.9995 


0.5160 
0.5557 
0.5948 
0.6331 
0.6700 
0.7054 


0.7389 
0.7703 
0.7995 
0.8264 
0.8508 


0.8729 
0.8925 
0.9099 
0.9251 
0.9382 


0.9495 
0.9591 
0.9671 
0.9738 
0.9793 


0.9838 
0.9874 
0.9904 
0.9927 
0.9945 


0.9959 
0.9969 
0.9977 
0.9984 
0.9997 


0.5199 
0.5596 
0.5987 
0.6368 
0.6736 
0.7088 


0.7422 
0.7734 
0.8023 
0.8289 
0.8531 


0.8749 
0.8944 
0.9115 
0.9265 
0.9394 


0.9505 
0.9599 
0.9678 
0.9744 
0.9798 


0.9842 
0.9878 
0.9906 
0.9929 
0.9946 


0.9960 
0.9970 
0.9978 
0.9984 
0.9998 


0.5239 
0.5636 
0.6026 
0.6406 
0.6772 
0.7123 


0.7454 
0.7764 
0.8051 
0.8315 
0.8554 


0.8770 
0.8962 
0.9131 
0.9278 
0.9406 


0.9515 
0.9608 
0.9686 
0.9750 
0.9803 


0.9846 
0.9881 
0.9909 
0.9931 
0.9948 


0.9961 
0.9971 
0.9979 
0.9985 
0.9998 


0.5279 
0.5675 
0.6064 
0.6443 
0.6808 
0.7157 


0.7486 
0.7794 
0.8078 
0.8340 
0.8577 


0.8790 
0.8980 
0.9147 
0.9292 
0.9418 


0.9525 
0.9616 
0.9693 
0.9756 
0.9808 


0.9850 
0.9884 
0.9911 
0.9932 
0.9949 


0.9962 
0.9972 
0.9979 
0.9985 
0.9999 


0.5319 
0.5714 
0.6103 
0.6480 
0.6844 
0.7190 


0.7517 
0.7823 
0.8106 
0.8365 
0.8599 


0.8810 
0.8997 
0.9162 
0.9306 
0.9430 


0.9535 
0.9625 
0.9700 
0.9762 
0.9812 


0.9854 
0.9887 
0.9913 
0.9934 
0.9951 


0.9963 
0.9973 
0.9980 
0.9986 
0.9999 


0.5359 
0.5753 
0.6141 
0.6517 
0.6879 
0.7224 


0.7549 
0.7852 
0.8133 
0.8389 
0.8621 


0.8830 
0.9015 
0.9177 
0.9319 
0.9441 


0.9545 
0.9633 
0.9706 
0.9767 
0.9817 


0.9857 
0.9890 
0.9916 
0.9936 
0.9952 


0.9964 
0.9974 
0.9981 
0.9986 
1.0000 


Table 2. Percentiles of the t distribution 
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P 
t 
Pp 
df | teo | t70 | to t.90 1.95 1.975 t.99 1.995 
1 325 | .727 | 1.376 | 3.078 | 6.314 | 12.706 | 31.821 | 63.657 
2 289 | 617 | 1.061 | 1.886 | 2.920 4.303 6.965 9.925 
3 277 | 584 978 | 1.638 | 2.353 3.182 4.541 5.841 
4 271 | 569 941 | 1.533 | 2.132 2.776 3.747 4.604 
5 .267 | 559 920 | 1.476 | 2.015 2.571 3.365 4.032 
6 265 | 553 .906 | 1.440 | 1.943 2447 3.143 3.707 
7) .263 | 549 896 | 1.415 | 1.895 2.365 2.998 3.499 
8 262 | 546 .889 | 1.397 | 1.860 2.306 2.896 3.355 
9 261 | 543 883 | 1.383 | 1.833 2.262 2.821 3.250 
10 .260 | 542 .879 | 1.372 | 1.812 2.228 2.764 3.169 
11 .260 | 540 .876 | 1.363 | 1.796 2.201 2.718 3.106 
12 259 | 539 873 | 1.356 | 1.782 2.179 2.681 3.055 
13 259 | 538 .870 | 1.350 | 1.771 2.160 2.650 3.012 
14 258 | 537 868 | 1.345 | 1.761 2.145 2.624 2.977 
15 258 | 536 .866 | 1.341 | 1.753 2.131 2.602 2.947 
16 258 | 535 .865 | 1.337 | 1.746 2.120 2.583 2.921 
17 257 | 534 .863 | 1.333 | 1.740 2.110 2.567 2.898 
18 257 | 534 .862 | 1.330 | 1.734 2.101 2.552 2.878 
19 257 | 533 861 | 1.328 | 1.729 2.093 2.539 2.861 
20 257 | 533 .860 | 1.325 | 1.725 2.086 2.528 2.845 
21 27. |) 2932 859 | 1.323 | 1.721 2.080 2.518 2.831 
22 256 | 532 858 | 1.321 | 1.717 2.074 2.508 2.819 
23 256 | 532 858 | 1.319 | 1.714 2.069 2.500 2.807 
24 256 | 531 857 | 1.318 | 1.711 2.064 2.492 2.797 
25 256 | 531 856 | 1.316 | 1.708 2.060 2.485 2.787 
26 256 | 531 856 | 1.315 | 1.706 2.056 2479 2.779 
2] 256 | 531 855 | 1.314 | 1.703 2.052 2.473 DAT 
28 256 | 530 855 | 1.313 | 1.701 2.048 2.467 2.763 
29 256 | 530 854 | 1.311 | 1.699 2.045 2.462 2.156 
30 256 | 530 854 | 1.310 | 1.697 2.042 2.457 2.750 
40 255 | 529 851 | 1.303 | 1.684 2.021 2.423 2.704 
60 254 | 527 848 | 1.296 | 1.671 2.000 2.390 2.660 
120 254 | 526 845 | 1.289 | 1.658 1.980 2.358 2.617 
oo | .253 | 524 .842 | 1.282 | 1.645 1.960 2.326 2.576 
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Table 3. Percentiles of the x2 distribution 


Xp 
2 2 2 2 2 2 2 2 2 
df} X05 X01 X'025 X'05 X10 X90 | X.95 | X.975 | X99 | X.995 
1 .000039 .00016 00098 0039 158} 2.71 3.84} 5.02] 6.63] 7.88 
2 .0100 0201 0506 1026 2107} 4.61 5.99} 7.38] 9.21} 10.60 
3 0717 1S 216 352 584 6.25] 7.81 9.35] 11.34] 12.84 
4| 207 297 484 711 1.064 7.78} 9.49] 11.14} 13.28] 14.86 
5 412 554 831 1.15 1.61 9.24} 11.07] 12.83} 15.09] 16.75 
6 .676 872 1.24 1.64 2.20 10.64} 12.59] 14.45] 16.81) 18.55 
7 989 1.24 1.69 217 2.83 12.02] 14.07} 16.01} 18.48] 20.28 
8} 1.34 1.65 2.18 243: 3.49 13.36] 15.51] 17.53} 20.09) 21.96 
9| 1.73 2.09 2.70 3.33 4.17 14.68} 16.92] 19.02] 21.67) 23.59 
10) 2.16 2.56 3.25 3.94 4.87 15.99) 18.31] 20.48] 23.21] 25.19 
11) 2.60 3.05 3.82 4.57 5.58 17.28] 19.68] 21.92] 24.73) 26.76 
12) 3.07 335] 4.40 5.23 6.30 18.55] 21.03} 23.34] 26.22} 28.30 
13) 3.57 4.11 5.01 5.89 7.04 19.81] 22.36] 24.74] 27.69} 29.82 
14) 4.07 4.66 5.63 6.57 7.719 21.06} 23.68] 26.12] 29.14) 31.32 
15) 4.60 5.23 6.26 7.26 8.55 22.31} 25.00] 27.49} 30.58] 32.80 
16) 5.14 5.81 6.91 7.96 9.31 23.54} 26.30] 28.85} 32.00] 34.27 
18) 6.26 7.01 8.23 9.39 10.86 25.99] 28.87] 31.53} 34.81] 37.16 
20| 7.43 8.26 9.59 10.85 12.44 28.41} 31.41] 34.17} 37.57] 40.00 
24| 9.89 10.86 12.40 13.85 15.66 33.20] 36.42] 39.36] 42.98) 45.56 
30) 13.79 14.95 16.79 18.49 20.60 40.26| 43.77] 46.98] 50.89] 53.67 
40|20.71 22.16 {2443 26.51 | 29.05 | 51.81] 55.76] 59.34] 63.69| 66.77 
60|35.53 37.48 |4048 {43.19 | 46.46 | 74.40] 79.08] 83.30] 88.38| 91.95 
120|83.85 86.92 |9158 |95.70 |100.62 |140.23|146.57| 152.21] 158.95] 163.64 


For large degrees of freedom, 


1 
xo = 5 &P 9/21 approximately, 


where v = degrees of freedom and z p is given by Table 1. 
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Table 4. One-Sample Kolmogorov-Smirnov Test 


(If calculated dy is greater than value shown, then reject the null 


hypothesis at the chosen level of significance) 


Sample Level of Significance for dy 

ae .20 5 .10 05 01 
1 .900 925 950 975, 995 
2 684 726 .776 842 929 
3 565 597 642 708 828 
4 494 525 564 624 .733 
5 446 AT74 10 565 .669 
6 410 A436 A470 521 618 
| 381 A05 A438 486 S77 
8 358 381 All AST 543 
9 339 360 388 432 514 
10 322 342 368 410 490 
11 307 326 352 391 468 
12 295 313 338 375 A50 
13 .284 302 325 361 433 
14 274 292 314 349 418 
15 .266 283 304 .338 404 
16 258 274 295 328 392 
17 .250 .266 286 318 381 
18 244 259 278 309 371 
19 237 252 272 301 363 
20 231 .246 264 294 356 
25 210 220 .240 .270 320 
30 .190 .200 220 .240 .290 
35 .180 .190 210 .230 .270 
omas | UE IE |e |e | 
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Table 5. Critical Values for the Two-Sample Kolmogorov—Smirnov Statistic 


Sample size ny 
1/2)3)] 4 5 6 7 8 9 10 12 15 
1 
> 7/8 | 16/18 | 9/10 | 11/12 | 26/30 
* * * * * 
4 * 12/15 | 5/6 | 18/21 | 18/24) 7/9 | 24/30} 9/12 | 11/15 
* * * 8/9 | 27/30 | 11/12 | 13/15 
3/4 | 16/20 | 9/12 | 21/28 | 6/8 | 27/36 | 14/20 | 8/12 | 41/60 
4 
10/12 | 24/28 | 7/8 | 32/36 | 16/20 | 10/12 | 48/60 
é 4/5 | 20/30 | 25/35 | 27/40 | 31/45 | 7/10 | 40/60 | 10/15 
8 4/5 | 25/30 | 30/35 | 32/40 | 36/45 | 8/10 | 48/60 | 11/15 
2 ‘ 4/6 | 29/42 | 16/24 | 12/18 | 19/30 | 7/12 | 18/30 
3 5/6 | 35/42 | 18/24 | 14/18 | 22/30 | 9/12 | 22/30 
5/7 | 35/56 | 40/63 | 43/70 | 51/84 | 61/105 
4 
5/7 | 42/56 | 47/63 | 53/70 | 58/84 | 70/105 
: 5/8 | 45/72 | 23/40 | 14/24 | 66/120 
6/8 | 54/72 | 28/40 | 16/24 | 80/120 
; 5/9 | 52/90 | 20/36 | 24/45 
6/9 | 62/90 | 24/36 | 29/45 
6/10 | 32/60 | 15/30 
10 
7/10 | 39/60 | 19/30 
‘5 6/12 | 30/60 
T/N2 | 35/60 
TAS 
15 
8/15 


Notes: 1. Reject Ho at the 5% or 1% level if d = sup| Fn, (x) — Fn, (x)| equals or exceeds the 
tabulated value. The upper value corresponds to w = .05 and the lower to a = .01. 
2. Where * appears, do no reject Hp at the given level. 
3. For large values of n, and nj, the following approximate formulas may be used: 


soa 136 [Oe 
nyng 


a0 1 (=, 
njn2 


Appendix II: Answers and Hints for Selected 
Odd-Numbered Exercises 


Exercise 1.1.1. 


(a) The sample points are HH, HT, TH, TT, and the elementary events are {HH}, 
{HT}, {TH}, {TT}. 


Exercise 1.1.3. 
(a) Two possible sample spaces to describe three tosses of a coin are: 


S; = {an even # of H’s, an odd # of H’s}, 
So = {HHHH, HHHT, HHTH, HHTT, HTHH, HTHT, HTTH, HTTT, 
THHH, THHT, THTH, THTT, TTHH, TTHT, TITH, TTTT}, 


where the fourth letter is to be ignored in each sample point. 


(c) It is not possible to find an event corresponding to the statement p = “at most 
one tail is obtained in three tosses” in every conceivable sample space for the tossing 
of three coins, because some sample spaces are too coarse, that is, the sample points 
that contain this outcome also contain opposite outcomes. For instance, in S$; above, 
the sample point “an even # of H’s” contains the outcomes HHT, HTH, THH for 
which our p is true and it also contains the outcome 77T, for which it is not true. 
Thus, p has no truth-set in S$. 


Exercise 1.1.5. 


In the 52-element sample space for the drawing of a card 


(a) the event corresponding to the statement p = “An Ace or a red King is drawn” is 
P = {AS, AH, AD, AC, KH, KD}. 


(b) a statement corresponding to the event U = {AH, KH, QH, JH} is u = “The Ace 
of hearts or a heart face card is drawn.” 
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Exercise 1.1.7. 


One possible sample space is: S = {January, February, ... , December}. 


Exercise 1.2.1. 


(a) {1, 3,5, 7,9} or {kk = 2n + 1,n =0, 1,2, 3, 4}. 


Exercise 1.2.5. 


AN BNC ={l}, (ANB)NC ={1,4}N (1, 2, 3, 7} = {I}, ete. 


Exercise 1.2.7. 

(a) AN (BUC) = {1,3,4,5} 9 {1, 2,3, 4, 6, 7} = {1, 3,4}, but (AN B)UC = 
{1,4} U (1, 2, 3, 7} = {1, 2, 3, 4, 7}. 

Exercise 1.2.9. 

Draw a Venn diagram with nonoverlapping sets A and B and number the three re- 
gions. 

Exercise 1.2.11. 


1. First, assume that A U B = B, that is, that {x : x € Aorx € B} = B. Hence, if 
x € A, then x must also belong to B, which means that A C B. 

Alternatively, by the definition of unions, A C AUB, and so, if AUB = B, then 
substituting B for A U B in the previous relation, we obtain that AU B = B implies 
ACB. 


2. Conversely, assume that A C B, and proceed similarly as above. 


Exercise 1.3.1. 


(a) The event R corresponding to r = “b is 4 or 5” is the region consisting of the 
fourth and fifth columns in Figure 1.4, that is, R = {(b,w) : b = 4,5 andw = 
1,2, in Of. 


Exercise 1.3.3. 


Ps = {2,3,4} = ABC U ABC U ABC. 


Exercise 2.1.1. 


Let A = set of drinkers, and B = set of smokers. Then n(AB) = 23. 
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Exercise 2.1.5. 


n(A) + n(B) + n(C) — n(AN B)— n(ANC)—-n(BNC)+n(AN BNC) = 
n(1,3,4,5) +1, 2,4, 6) +n, 2, 3,7) —nd, 4) —n1, 3) —n1, 2) + n(1), ete. 


Exercise 2.2.1. 


(a) S = {ASAH, ASAD, ASAC, AHAS, AHAD, AHAC, ADAS, ADAH, ADAC, ACAS, 
ACAH, ACAD}, 


Exercise 2.2.5. 


(a) 24360, (b) 27000. 


Exercise 2.2.7. 


(a) 14, (b) 30. 


Exercise 2.3.1. 


20, 120,8,1, 1. 


Exercise 2.3.5. 


{ABC ACB BAC BCA CAB CBA}, etc. 


Exercise 2.3.7. 


360. 


Exercise 2.3.9. 


(a) 24, (b) 12. 


Exercise 2.3.11. 


(a) 1666, (b) 1249, (c) 416, (d) 2500. 


Exercise 2.4.1. 


The tenth row is 
1 10 45 120 210 252 210 120 45 10 
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Exercise 2.4.5. 


45. 


Exercise 2.4.7. 


(a) 5”. 


Exercise 2.4.9. 


(a) 2” —1—n. 


Exercise 2.5.1. 


(a) 64, (b) 15, (c) 60, (d) 240. 


Exercise 2.5.3. 


(a) 420, (b) 60, (c) 300, (d) 240. 


Exercise 2.5.5. 


(a) 210, (b) —22,680. 


Exercise 2.5.7. 


(a) 66, (b) 36. 


Exercise 3.1.1. 


(g) (BNC) =0, (h) P(B UC) = 48/52. 


Exercise 3.1.3. 


A = ABU AB and AB Q AB = @%. Thus, by Axiom 3, P(A) = P(AB)+ P(AB). 
Similarly, P(B) = P(AB)+ P(AB). Add, use Axiom 3 again, and rearrange. 


Exercise 3.1.5. 


The given relation is true if and only if P(AB) = 0. 
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Exercise 3.1.7. 


(a) This result follows at once from Theorem 3.1.2 because we are subtracting the (by 
Axiom 1) nonnegative quantity P(AB) from P(A)+ P(B) on the right of Equation 
3.1 to get P(A U B). 


(c) Use induction. 


Exercise 3.2.1. 

(b) P(A and K) = 8/12 = 2/3. 

(d) Here, each unordered pair corresponds to two ordered pairs and therefore each 
one has probability 2 - 4 = | In Example 3.2.2, some unordered pairs correspond 
to two ordered pairs and some to one. 


Exercise 3.2.3. 


We did not get P(at least one six) = 1, in spite of the fact that on each throw the 
probability of getting a six is 1/6, and 6 times 1/6 is 1, for two reasons: First, we 
would be justified in taking the 1/6 six times here only if the events of getting a 
six on the different throws were mutually exclusive; then the probability of getting a 
six on one of the throws could be computed by Axiom 3 as 6 - (1/6), but these are 
not mutually exclusive events. Second, the event of getting at least one six is not the 
same as the event of getting a six on the first throw, or on the second, or etc. 
Exercise 3.2.5. 


5/9. 


Exercise 3.2.7. 


min! /(m+n—1)!. 


Exercise 3.2.9. 


P(jackpot) = 1/5,245,786 © 2- 1077 and P(match 5) = 108/2,622,893 ~ 4- 10-5. 


Exercise 3.2.11. 
(a) 3/8, (b) 0.441, (c) 0.189. 


Exercise 3.2.13. 


P(all different) ~ 0.507. (Note that we have included “straights” and “flushes” in 
the count, that is, cards with five consecutive denominations or five cards of the same 
suit, which are very valuable hands, while the other cases of different denominations 
are poor hands.) 
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Exercise 3.2.15. 


P(full house in poker) ~ 0.0014. 


Exercise 3.2.17. 


P(full house in poker dice) * 0.0386. 


Exercise 3.2.19. 


If0 < ny <n,ny < Nj andn — ny, < Ny, then the last inequality is equivalent 
ton — N2 < nj, which together with 0 < n; means that n; is greater than or equal 
to both 0 and n — Np, and so max(0,n — Nz) < n,. The middle two inequalities 
say that n; is less than or equal to both n and Nj, and son; < min(n, N)). Thus, 
O0<n, <n,ny < N andn —n, < No imply max(0,n — N2) < nj < min(n, Nj). 

Conversely, if max(0,n — N2) < ny < min(n, Nj), then the first part implies 
that 0 < n; andn — Nz < nj, 0rn — ny, < No, and the second part implies that 
ny <nandn, < N,. Thus, max(0,n—N2) < ny < min(n, N;)impliesO < ny <n, 
ny < N; andn—n, < Np. 


Exercise 3.3.1. 


Let E = “even” and O = “odd” and consider the sample space S = {EEE, EEO, 
EOE, EOO, OEE, OEO, OOE, OOO} for throwing three dice. Compute P(A), P(B) 
and P(AB). 


Exercise 3.3.5. 


(a) Let A and B be independent. Then P(AB) = P(A) — P(AB) = P(A) - 
P(A)P(B) = P(A)[1 — P(B)] = P(A)P(B). 


Exercise 3.3.7. 
POS C2 pUy=5 (12 pO)= 10+ (12) ete: 
Exercise 3.3.9. 


P(two of each color) + 0.123. 


Exercise 3.3.11. 


Expand both sides of P(A(B UC)) = P(AB U AC). 
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Exercise 3.4.3. 


If A = {K or 2} and B = {J, Q, K}, then P(A|B) = 1/3. 


Exercise 3.4.5. 


Apply Theorem 3.4.1, Part 3. 


Exercise 3.4.7. 


P(two girls and one boy | one child is a girl) = 1/2. 


Exercise 3.4.9. 


P(two Kings | two face cards) = 1/11. 


Exercise 3.4.11. 


P(exactly one King | at most one King) = 8/55. 


Exercise 3.5.1. 


(c) 17/40. 


Exercise 3.5.3. 


(a) 1/33, (b) 5/101, (c) 1/17, (d) 1/2. 


Exercise 3.5.5. 


Equation 3.52 becomes P(A,,) = P(Am+1) - p+ P(Am—1)-q forO < m <n, 
where g = | — p and A,, denotes the event that the gambler with initial capital m is 
ruined. Try to find constants 4 such that P(A,,) = 4” for 0 < m < n, just as in the 
analogous,but more familiar, case of linear homogeneous differential equations with 
constant coefficients. Solve the resulting quadratic equation pA* — 4 + q = 0. The 
general solution of the difference equation is then of the form P(A,,) = aA’ +baz! = 
a + b(q/p)”. As in Example 3.5.5, use the boundary conditions P(Ag) = 1 and 
P(A,,) = 0 to determine the constants a and b. Thus, the probability of the gambler’s 
ruin is P(Am) = [(¢/p)” — (¢/p)"1/L. — (¢/p)"), if he starts with m dollars and 
stops if he reaches n dollars. If gq < p, that is, the game is favorable for our gambler, 
then lim,,-+o0(¢/p)” = 0, and so the gambler may play forever without getting 
ruined and the probability that he does not get ruined is 1 — (g/p)". 
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Exercise 3.5.7. 


5/17. 
Exercise 3.5.9. 


P(GGI|G) 
7 P(G|GG)P(GG) 
~ P(G|GG)P(GG) + P(G|BG)P(BG) + P(G|GB)P(GB) + P(G|BB)P(BB) 
1 


5" 


Exercise 3.5.11. 
P(WB|BW U WB) = 4/49. 


Exercise 3.5.13. 


Let A = “The witness says the hit-and-run taxi was blue,’ B; = “The hit-and-run 
taxi was blue,’ and By = “The hit-and-run taxi was black.” Then P(B,|A) * 0.41. 
Thus, the evidence against the blue taxi company is very weak. 


Exercise 4.1.1. 


The pf. of X is given by f(x) = (2)(5°.)/(3), forx =0, 1,... ,5, and the df. of 
X is given by 

0 ifx <0 

22 atex-<1 

633 ifl<x<2 

F(x) * 4.907 if2<x <3 

989 if3<x<4 

999 if4<x <5 

life, 


Exercise 4.1.3. 
The possible values of x are 0, +2, +4, and f(0) = 6/16, f(+2) = 4/16, and 
f (4) = 1/16. The histogram is shown in Figure II.1. 


Exercise 4.1.5. 


The possible values of x are 3,4,5, and f(3) = 5/8, f(4) = 5/16, and f(5) = 
1/16. 
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0.47 


0.37 


0.27 


Fig. 11.1 


Exercise 4.1.7. 
The pf. is given by f (2) = P(X = 2) = p?+q*, f (3) = pq? +qp* = pq(q+p) = 


pq, f(4) = pa(p? +9°), £6) = pa? +4? Pp? = (pq) + P) = (pq)*. Thus, in 
general, f (2n) = (pq)"~!(p* +q7) and f (2n + 1) = (pq)" forn = 1,2,3,.... 


Exercise 4.1.9. 


The df is 
0 ifx <1 
F(x)= 4 [xJ/6 ifl<x <6 
1 ifx > 6. 


Exercise 4.1.11. 


First, we display the possible values of X in a table as a function of the outcomes on 
the two dice: 


w\b |} 1}2)/3})4)5 | 6 
1 O;1;2);3)4)5 
2 1;0};1)2)34)4 
3 2); 1);0)]1)2) 3 
4 3}2}1]0 1 4 2 
5 4/3 ;2);1)/0)1 
6 5 |}4}],3}]2]140 
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Since each box has probability 1/36, from here we can read off the values of the 
p-f. as 


6/36 ifx =0 
10/36 ifx =1 
ihe 8/36 ifs =2 
6/36 ifx =3 
4/36 ifx =4 
2/36 ifx=5. 


Exercise 4.1.13. 


Since A, Az... is a nondecreasing sequence of events, A = Aj U [Ure (Ak _ 
Ax—1)] and the terms of the union are disjoint, Axiom 2 gives P(A) = P(A1) + 
bane. — Ax_1). By the definition of infinite sums, the expression on the right is 
the limit of the partial sums, that is, P(A) = limy—oo[P(A1) + a=. P(A; — Ag_1)]. 
Apply Axiom 2 again. 


Exercise 4.1.15. 


Let (x,) be a sequence of real numbers decreasing to —oo, and let A, = {s : X(s) < 
Xn} for every n. Then F(x,) = P(Ay) and An D An+; forn = 1,2,.... Further- 
more, A = Mo Ak = §, because there is no s € S for which the real number X (s) 
can be < x, for every n, considering that x, — —oo. Apply the result of Exercise 
4.1.14 and the theorem from real analysis quoted in the hint. 


Exercise 4.2.1. 


1.C=1/8, 4.P(X <1)=1/16, 5.P(2 <X)=3/4. 


Exercise 4.2.3. 


LC=i, 4P8 eHSie. PC 2ixpsiz. 


Exercise 4.2.5. 


1. Roll a die. If the number six comes up, then also spin a needle that can point with 
uniform probability density to any point on a scale from 0 to | and let X be the 
number the needle points to. If the die shows 1, then let X¥ = 1, and if the die shows 
any number other than | or 6, then let X = 2. 


2. P(X < 1/2) = 1/12, 3. P(X < 3/2) = 1/3, 4.P(1/2 < X <2) = 1/4, 
5.P(X =1)=1/6, 6.P(X > 1) =2/3, 7.P(X =2)=2/3. 
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Exercise 4.2.7. 
2.P(X < 1/2) = 1/10. 3.P(X < 3/2) = 3/5. 4.PCU/2 < X < 2) = 7/10. 
5.P(X¥ =1)=1/5. 6.P(X¥ > 1) =3/5. 7.P(X =2)= 1/5. 
Exercise 4.3.1. 


First, make a table whose first row contains the possible values of x, the second row 
the corresponding values of f(x), and the third row the values of y = ee 
From this table extract a new table for the p-f. of Y. 


Exercise 4.3.3. 
Pei 1/2 ify =0 
Be =) 12: Ay =e 
Exercise 4.3.5. 
e ify <0 
F = . 
v0) f if y > 0. 
Exercise 4.3.7. 
Aes ' ify <0 
V(Y) = 4 py . 
ae fx(x)dx ify >0. 
Exercise 4.3.9. 


ify <-r 


1 
Fy(y)=P(Y <y)= 4+ —aresin> if r<y<r 
4 r 


RS pole oO 


ifr<y. 


Exercise 4.4.1. 


First, make a 6 x 6 table with the possible values of X and Y on the margins and the 
corresponding values of U = X + Y and V = X — Y in the body of the table. Next, 
make an 11 x 11 table with the possible values of U and V on the margins and the 
corresponding values of fy, y (u, v) in the body of the table, which are obtained from 
the first table, considering that each box there has probability 1/36. 


Exercise 4.4.3. 


1.5/324, 2.5/3888. 
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Exercise 4.4.5. 
0 ifz<0O 
Fr(z)= 427 if0<z<1 
1 ifl <z. 


Exercise 4.4.7. 


1.C = 10. 
2. fx(x) = 10(x — e°y/3 if0 < x < l,and fxy(x) = 0 otherwise. Similarly, 
fy(y) = 5y* for 0 < y < 1, and fy(y) = 0 otherwise. 


3. If (x, y) € D, then F(x, y) = (5/3)y?x? — (2/3)x°. If0 <x <1landy > 1, 
then F(x, y) = F(x, 1) = (5/3)x? — (2/3)x°. If0 < y < landx > y, then 
F(x, y) = FO, y) =y.Ifx > land y > 1, then F(x, y) = 1, and F(x, y) = 0 
otherwise. 


4.P(X > Y”) = 2/7. 


Exercise 4.4.9. 


In the xy-plane, draw the four points (x1, y1), (%1, 2), (*2, v1), (%2, y2) and the 
quarter planes to the left and below each of these points. Number the regions. The 
probabilities of the quarter planes are the values of F in the given points. Use this 
fact and the additivity axiom for the numbered regions to prove the formula. 


Exercise 4.5.1. 


Compute some joint and marginal probabilities and test whether the product rule for 
independence holds. 


For instance, f(0, 1) = P(X =0,Y =1)= Weis =. 


1 


Exercise 4.5.3. 


Compare to Examples 4.4.5 and 4.5.2. 


Exercise 4.5.5. 


1. By the definition of indicators, J4g(s) = 1 << s € AB. By the definition of 
intersection, s € AB  (s € Aands é€ B), and, by the definition of indicators, 
(s € Aands € B) & (J4(s) = 1 and Jp(s) = 1).Since 1-1 = 1 and1.0 =0-0=0, 
clearly, (74(s) = | and Jp(s) = 1) & TJ,(s)Zp(s) = 1. Now, by the transitivity 
of equivalence relations, J4g(s) = 1 + J,4(s)Igp(s) = 1, which is equivalent to 
Tap = Lalp. 
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Exercise 4.5.7. 


7/16. 


Exercise 4.5.9. 


(2/3) — (V3/2z). 


Exercise 4.5.11. 


lL. Fz(2) = JO? fre” fx @dxldy, and fz(z) = f° fro) fx(z/y)A/y)dy. 
2. Fz(z) = fo° fr Lo” fx (x)dx]dy, and fz(z) = {o> fy”) fx (zy)ydy. 
3. fz(z) = —Inz if0 < z < l,and fz(z) =0 otherwise. 


Exercise 4.5.13. 


1.P(T > 200) + 0.135. 
2.P(T < 400) ~ 0.330. 

3. P(max T; < 200) + 2-107. 
4. P(min T; < 40) © 0.999985. 


Exercise 4.5.15. 


Use Definition 4.2.3 and Corollary 4.5.1. 


Exercise 4.6.1. 


The joint distribution of X and Y is trinomial (with the third possibility being that 
we get any number other than | or 6) and so, fy y@, j) =P(X =i,Y = jf) = 
(; ; ,) (1/6)! (1/6)! (4/6) fori, j,k =0,1,...,4,2 +7 +k =4-. Use this formula 
to compute the marginals and the conditional p-f., which is given by fxjy(i|j) = 


fxv@, D/frQ). 


Exercise 4.6.3. 


By Example 4.5.4, 


1 1 1 
P(A|X =x) = 


1 1 _ 1 
Pix-~<Y< =1l-x if-~-<x<1l, 
2 2 2 


and P(A) = 1/4, and, by Equation 4.141, fy;4(x) = [P(A|X = x) fy(x)]/P(A). 
Substitute into the latter. 
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Exercise 4.6.5. 


First, compute the marginal densities. By definition, 


2 if(*,y)EeD 

FG y= ; otherwise 
and so 
1-x 
2dy =2(1—-x) if0<x <1 
fx(x) = | 
0 otherwise 
and 


l-y 
2x =21—y) ifO<y<1 
fy(y) = i 
0 otherwise. 


Now use Equation 4.128 to obtain the required conditional densities. 


Exercise 4.6.7. 


1. First, compute P(Z < z|X = x) = Frzx(z|x) =P@+Y < z) =P(Y < 
z—x). Next, find fz)x(z|x) = 0F zx (z|x)/dz. Then use Equation 4.144, to obtain 
fxz(x|z). 


2. Draw a diagram to show that 


0 ifx <0 

1 1 >. 

1 

= ifx > 1 

2 

and so, that 
ifx <0 
P(X <x,Z <1) a 
Fy|a(x) = 12 = (l—-x)* if0<x <1 

1 ifx>1. 


Differentiate to obtain fx\4(x). 


Exercise 5.1.1. 


E(X) = 85/13 © 6.54. 
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Exercise 5.1.3. 


E(T)= Jeo t- A*te~*'dt. Integrate by parts twice to obtain E(T) = 2/d. 


Exercise 5.1.7. 


The distribution of a discrete X is symmetric about a number a if all possible values 
of X are paired so that for each x; < a there is a possible value x; > a, and 
vice versa, such that a — x; = xj —a@ and f(x;) = f(x;). For such X, E(X) = 
Dean xi f Oi) = ee Xi f xi) + af (@) + ee xj f (xj). (Here f(a) = 0, if 
a is not a possible value of X.) In the last term, apply the symmetry conditions and 
simplify. 

Exercise 5.1.9. 


Use Theorem 5.1.3 and the integral from the solution of Exercise 5.1.3. 


Exercise 5.1.11. 


Use Theorem 5.1.3. 


Exercise 5.1.13. 
Example 4.3.1 gives, for continuous X and Y = aX + b, wherea ¥ 0, fy(y) = 
(1/la|) fx ((y — b)/a). Use this expression to compute E(aX + b) = E(Y) and in 


the integral change the variable y to x = (y — b)/a separately when a < 0 and when 
a>0O. 


Exercise 5.1.15. 


In the expression for E (X) factor out np and then use the binomial theorem. 


Exercise 5.1.17. 


In this case, Theorem 5.1.6 does not apply, because X and Y are not independent. 
Nevertheless, Equation 5.1.52 is still true, and you have to check it directly. 


Exercise 5.1.19. 


E(Z) = 2/3. 


Exercise 5.1.21. 


Use the hint and the formula for the sum of a geometric series. 
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Exercise 5.2.1. 


For instance, if X is a continuous r.v. with density 


Qi? ifl <x = co 


Io)=) 4 


otherwise 


and Y = —X, then show that X and Y are as required. 


Exercise 5.2.3. 


Prove Var(aX + bY +c) = a?Var(X) + b?Var(Y). 


Exercise 5.2.7. 

1. E(X +2Y) =3/A, and Var(X + 2Y) = 5/A?. 

2. E(X —2Y) = —1/d, and Var(X — 2¥) = 5/a?. 

3. E(XY) = 1/22, and Var(XY) = 3/a4. 

4. E(X*) = 2/22, and Var(X*) = 20/a4. 

5. E((X + Y)*) = 6/47, and Var((X + Y)*) = 84/a4. 


Exercise 5.2.9. 


No: E(X) = E(Y) = np =n/2 and E(XY) = (n? —n)/4. 


Exercise 5.3.1. 
Write X = X —x and _ Y—y. Then evaluate m3(X +Y) = E((X+Y)*) using 


the independence of X and Y, from which the independence of X and Y follows. 


Exercise 5.3.3. 


vy (t) = vx are”. 


Exercise 5.3.5. 


Wx—-u(t) = E(e'*—-) and so, by the result of Exercise 5.3.3, vy-p(t) = 
wx(te = (pe’ + q)"e"! = (pet + qe~?')". Now, Var(X) is the second 
moment of X — jw. Use the expression above and Theorem 5.3.1 to compute Var(X) 
as Wy (0). 
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Exercise 5.3.9. 


Use the appropriate definitions and the binomial theorem. 


Exercise 5.3.11. 


Let X1, X2, X3 denote the points showing on the three dice, respectively and let 
S = X|+X2+X3. Find Gy, (s) for anyi,and then Gs(s) = Gy, (s). The coefficients 
of s* here are the required probabilities, and so, p3 = 1/216, pa = 3/216 = 1/72, 
and ps5 = 6/216 = 1/36. 


Exercise 5.4.1. 


Write Var(X + Y) as an expectation in terms of X — wx and Y — py, and simplify. 


Exercise 5.4.3. 


1. Write Cov(U, V) in terms of X=x- [Lx and Y=-y- fly, and simplify. 
2. Compute, for instance, P(V = 0|U = 2) and P(V = 0), and use Theorem 4.6.3. 


Exercise 5.4.5. 


m n 


m n 
Cov(X,¥) = S20 pijxiys — DO vixi DS Py. 
i=l j=l 


i=l j=! 


Exercise 5.4.7. 


First show that Cov(U, V) = acCov(X, Y), oy = |alox, and oy = |cloy. 


Exercise 5.4.9. 


Use the result of Exercise 5.4.1 and Theorem 5.2.2. 


Exercise 5.5.1. 


Use Definition 5.5.1, for discrete X and Y, and Theorem 5.1.3 with g(Y) = Ey(X). 


Exercise 5.5.3. 


E(X) =9/4. 
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Exercise 5.5.5. 

Ey(X) = (1 — y)/2 if 0 < y < Land E)(X) = 0 otherwise. E,(Y) = (1 — x)/2 if 
0 <x <1 and E,(Y) = 0 otherwise. E(X) = E(Y) = 1/3. 


Exercise 5.5.7. 


*. POLe<2 leo 1 
E,(X) = 42 nd: Baga lg BS 
0 


otherwise 0 otherwise. 
Exercise 5.5.9. 


Use Definition 5.5.1 and Theorem 5.1.4. 


Exercise 5.5.11. 


Use Theorem 5.5.1 and Definition 5.5.2. 


Exercise 5.5.13. 


Show that for continuous (X, Y) with density f(x, y), E(Vary(X)) = 1 ee an [x— 
Ey(X))’ f (x, y)dxdy and Var(X) = f°. [lx — EOOP f @, y)dxdy, and since 
Ey(X) # E(X) in general, also Var(X) # E(Vary (X)) in general. 


Exercise 5.6.1. 


1.Letn = 2k +1 fork = 1,2,....Then the median is m = xx+41. 


2. Let n = 2k fork = 1,2,....Then any number m such that x, <m < xx41 18 a 
median. 


Exercise 5.6.3. 


The converse of Theorem 5.6.1 says: For m a median of a random variable X ,P(X < 
m) = 1/2 and P(X > m) = 1/2 imply P(X = m) = 0. Show that this statement is 
true. 


Exercise 5.6.5. 


Write E(|X — c|) as two integrals without absolute values and differentiate with 
respect to c using the Fundamental Theorem of Calculus. Thus show that FE (|X — c|) 
has a critical point where 2F(c) — 1 = 0, or where F(c) = 1/2, that is, if c is a 
median m. Since we assumed that f is continuous and f(x) > 0, m is unique. Use 
the second derivative test to show that E(|X — c|) has a minimum at c = m. 
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Exercise 5.6.7. 
For general X the 50th percentile is defined as the number x.5 = min{x : F(x) => 
0.5}. Show that this x.5 satisfies the two conditions in the definition of the median. 
Exercise 5.6.9. 


The quantile function is F~!(p) = 2./p — | for p € (0, 1). 


Exercise 5.6.11. 


Invert p = x/4 and p = x/4+1/2 separately. The graph of F~!(p) is the reflection 
of the graph in Fig. 4.8 across the y = x line. 


Exercise 6.1.1. 


1. P(X(1) > 1) © 0.264. 
2. P(X (2) > 2) © 0.323. 

3. P(X; (1) > 1 and X2(1) > 1) © 0.0698. 

4. P(X, (1) = 2 and X2(1) = 2|X (2) = 4) © 0.375. 


Exercise 6.1.3. 


1. P(X(1) > 2) © 0.323. 
2. P(X (2) > 4) + 0.371. 
3. P(X(1) > 4) © 0.053. 
4. P(T, > 1) © 0.135. 

5. P(T> > 1/2) © 0.135. 


Exercise 6.1.5. 


P(even) — P(odd) = et Also, P(even) + P(odd) = 1. 


Exercise 6.1.7. 


Consider the instants s— As < 5s <t <t+At<s'—As’ <s' <t' <t’+At' and 
let T; and Tz denote two distinct interarrival times. Compute /7,,7,(¢ — s, t’ —s') as 
a limit and, in the last step, use part 2 of Theorem 6.1.7. If t = s’, the proof would 
be similar. 
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Exercise 6.2.1. 


Using the table, we obtain 
1.P(Z < 2) + 0.9772, 
2.P(Z > 2) + 0.0228, 
3.P(Z = 2) = 0, 

4.P(Z < —2) + 0.0228, 
5.P(—2 < Z < 2) © 0.9544, 
6. P(\Z| > 2) © 0.0456, 
7.P(—2 < Z < 1) © 0.8185, 
8.z © 1.6448, 

9.z = 1.6448, 

10.z ¥ 1.2815. 


Exercise 6.2.3. 


1. Differentiate g(z) = (1/V In)e~*/? twice and show that g”(z) changes sign at 


2? = 1,thatis,atz = +1. 


2. Differentiate f(x) = (1/V Ino )e~*- Ww? /20* twice and show that f” (x) changes 


sign at ((x — )/o)* = 1, that is, at (x — w)/o = 4 


Exercise 6.2.7. 


a) 


Compare ce~+2)"/24 with the general normal p.df. (1/V Ina )e~ 2-H)? /20° 


Exercise 6.2.9. 


PX = Xo) 3) 20.27. 


Exercise 6.2.11. 


Solve z = ®~!(1 — p) for p, to get the area of th 


e tail to the right of z under the 


standard normal curve. Switch to the area of the corresponding left tail and solve the 


resulting equation for z. 


Exercise 6.3.1. 


With the binomial: P(X = 3) © 0.238. With the normal approximation: P(2.5 < 


X < 3.5) © 0.2312. 
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Exercise 6.3.3. 


A single random number X is a uniform random variable with ~ = 1/2 and 
o? = 1/12. By Corollary 6.3.2, the average X of n = 100 i.i.d. copies of X is ap- 
proximately normal with py = 1/2 and og = ./1/1200 © 0.029. Thus, P(0.49 < 
X < 0.51) = P((0.49 —0.5)/0.029 < (X — 0.5)/0.029 < (0.51 — 0.5)/0.029) ~ 
P(—0.345 < Z < 0.345) = 28(0.345) — 1 © 0.27. 


Exercise 6.3.5. 


l.n> 144. 
2.n > 390. 


Exercise 6.4.1. 


P(r successes before s failures) = )y 1%! (se gt 


Exercise 6.4.5. 


Use P(X, = k, Xray = 1) = P(X; = K)P(X, =1—B). 


Exercise 6.4.7. 


Differentiate the gamma density from Definition 6.4.2. 


Exercise 6.4.9. 

1. Modify the proof in Example 6.4.2. 

2. Use Part 1 and Var(T) = E(T”) —[E(T)]?. 

3. Use the definition of y(t) and again modify the proof in Example 6.4.2. 
Exercise 6.4.11. 


Use mathematical induction on k. 


Exercise 6.4.13. 


Compute Fy (u) first, then differentiate. U turns out to be exponential with parameter 
i = 1/207. In particular, the xe distribution is the same as the exponential with 
parameter 1/2. 
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Exercise 6.4.15. 
Use Theorem 4.5.8. 


Exercise 6.4.19. 


Use Theorem 4.6.2, Equation 4.142 but, instead of evaluating the integral in the 
denominator, determine the appropriate coefficient for the numerator by noting 
that it being a power of p times a power of | — p, the posterior density fp\x 
must be beta. Thus, fp,|y is beta with parameters k + r and n — k + s, and 
c=1/Be+r,n—k+s). 


Exercise 6.5.1. 

Clearly, Y; and Y2, as linear combinations of normals, are normal. To show that they 
are standard normal, compute their expectations and variances. 

Exercise 6.5.3. 

Equate the coefficients of like powers in the exponents in Equation 6.147 and in the 
present problem. 

Exercise 6.5.5. 


Use the result of Exercise 5.4.8 and Theorems 5.4.2, 6.5.1 and 6.5.4. 


Exercise 6.5.7. 


First show that X2 under the condition X; = 80 is normal with ~*~ 77 ando ~& 
8.57. Hence x.99 © 88. 


Exercise 6.5.9. 


If (X1, X2) is bivariate normal as given by Definition 6.5.1, then aX; + bX2 is a 
linear combination of the independent normals Z; and Z2, plus a constant, and so 
Theorems 6.2.4 and 6.2.6 show that it is normal. 

To prove the converse, assume that all linear combinations of X; and X9 are nor- 
mal, and choose two linear combinations, 7; = a; X; +b, X2 and T) = a2X1 +b2X2 
such that Cov(7|, 72) = 0. Such a choice is always possible, since if Cov(X1, X2) = 
0, then 7; = X, and 7) = X> will do, and otherwise the rotation from Exercise 6.5.5 
achieves it. Next, proceed as in the proof of Theorem 6.5.1. 


Exercise 6.5.11. 


Hy, = 0, uu, = 5,012 = 48,05, = 44.2,09, = 5.8, oyu, = —7, and 
PU,,U, ~ —0.437. 
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Exercise 7.1.5. 


(a) Differentiate In(L(A)) = n Ina + (A — 1) 0 Inx;. 
(b) Express 4 as a function of E(X) and replace E(X) by Xp. 


Exercise 7.1.7. 


Use Theorem 4.5.8 to find fy (y) and the latter to compute E (©). 


Exercise 7.1.9. 


@ & 0.022 and the required approximate confidence intervals are (53.4%, 60.6%), 
(52.7%, 61.3%), and (51.3%, 62.7%). 


Exercise 7.2.1. 


The P-value is about 0.294. This probability is high enough for us to accept the null 
hypothesis, that is, that the low average of this class is due to chance, these students 
may well come from a population with mean grade 66. 


Exercise 7.2.3. 


Using a large-sample paired Z-test for the mean increase 4 = (42 — 1, Of the weights, 
we find the approximate P-value to be 0.0002. Thus, we reject the null hypothesis: 
the diet is very likely to be effective; however, the improvement is slight and the 
decision might hinge on other factors, like the price and availability of the new diet. 


Exercise 7.3.1. 


b) If u = 6.5, then the drug has really reduced the duration of the cold from 7 to 6.5 
days, and the test will correctly show with probability 0.841 that the drug works. 


Exercise 7.3.3. 


The rejection region is (—oo, 26.5]. The power function is given by 7(w) = P(X € 
Clu) = P(X < 26.5|) © ©((26.5 — )/24). 


Exercise 7.3.5. 


Let X denote the number of nondefective chips. The rejection region is the set of 
integers C = {0,1,2,... , 10}. The operating characteristic function is 1 — m(p) = 
P(X € Clp) = (2) p22 — p) + (4) pa — p)! 

(X €Clp) = (5) p70 — p)? + Gp — py. 


306 Appendix II: Answers and Hints for Selected Odd-Numbered Exercises 


Exercise 7.4.1. 


862.26 < yw < 1035.74 and 41.8 < o < 201.4. 


Exercise 7.4.3. 


The P-value is P(T > 0.995) ~ 0.2, and so we accept the null hypothesis, the truth 
of the store’s claim. 


Exercise 7.4.5. 


Use the limit formula limg_,.(1 + x/k)* =e 


Exercise 7.5.5. 

Divide the interval into four equal parts (in order to have the expected numbers be at 
least five) and compute x. 

Exercise 7.5.7. 


Extend the given table to include the marginal frequencies. Hence, the expected fre- 
quencies under the assumption of independence can be obtained by multiplying each 
row frequency with each column frequency and dividing by 88. Compute %7 and the 
number of degrees of freedom and use a x” table. 


Exercise 7.6.1. 


The P-value is about 0.0026. 


Exercise 7.6.3. 
By the definition of F,,, and the independence of the chi-square variables involved, 
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Use the definition of the I function in the evaluation of E (1/ ¥?). 


Exercise 7.7.1. 


Use Equation 7.152. 


Exercise 7.7.5. 


Use Equation 7.154. 
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